These benchmark tests demonstrate the performance of LuaJIT compiler, LuaJIT interpreter and Lua 5.1.
LuaJIT stated that globals and locals now has the same performance unlike in plain Lua.
LuaJIT stated that it's faster than Lua. Even Lua suggests to use LuaJIT for more performance.
LuaJIT uses its own interpreter and compiler and many other optimizations to improve the performance. But is it really fast?
About tests
This site contains results and conclusions for LuaJIT compiler, LuaJIT interpreter and Lua 5.1.4.
LuaJIT interpreter is accounted because it's a useful information for functions in which you 100% sure they won't compile.
Or maybe you're using embedded LuaJIT 2.0 which aborts on any C function (And FFI is disabled).
Lua 5.1 is accounted for you decision in what to choose, or just out of curiosity.
First 14 benchmark tests were taken from this pageNew benchmark tests are welcome.
Specs: Intel i5-6500 3.20 GHz. 64-bit. LuaJIT 2.1.0-beta3. (Lua 5.1.4 for plain Lua tests) (LuaJIT 2.0.4 for LuaJIT 2.0 assembler tests)
(JIT: ON SSE2 SSE3 SSE4.1 BMI2 fold cse dce fwd dse narrow loop abc sink fuse)
Benchmark Code
Source code
For benchmark tests we use the median of 100 takes of the given amount of iterations of the code.
for take = 1, 100 do
local START = os.clock()
for times = 1, iterations do
...
end
local END = os.clock()
end
For assembler tests we use luajit -jdump=+Arsa asmbench.lua.
The total amount of instructions is based on maximum possible amount (Last jump or RET).
Bytecode size is used from -jdump, not -bl, so it also counts sub-functions instructions and headers.
Script for bytecode test.
Each global lookup can cost around 11 instructions. They both run almost on the same speed, but this benchmark tests only one global.
This is still a good practice to localize all variables you need.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Global
0.24571 sec(s)
0.23929
0.29617
0.24856
(102.83%)
2
Local
0.23894 sec(s)
0.22918
0.32741
0.24434
(100%)
Conclusion:
JIT matches the performance of globals and upvalues. This is a good practice to localize all variables you need, upvalues are still faster.
As the first test concluded, each table indexing can cost around 11 additional instructions. Localizing math table won't help much. Localize your variables.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Global table indexing
0.36948 sec(s)
0.36357
0.3908
0.37024
(146.27%)
2
Local
0.25259 sec(s)
0.24956
0.26611
0.25344
(100%)
Conclusion:
Localizing exact value will get you more performance.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Global table indexing
0.9335 sec(s)
0.884
1.039
0.93893
(120.84%)
2
Local
0.7725 sec(s)
0.733
0.889
0.77743
(100%)
Conclusion:
Localizing exact value will get you more performance.
LuaJIT compiles them with the same performance.
However, LuaJIT suggests not to second-guess the JIT compiler, because unnecessary localization can create more complicated code.
Localizing local c = a+b for z = x[a+b] + y[a+b] is redundant. JIT perfectly compiles such code as a[i][j] = a[i][j] * a[i][j+1].
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Direct call
0.5611 sec(s)
0.55
0.6099
0.56474
(120.64%)
2
Localized call
0.46508 sec(s)
0.4563
0.5834
0.46996
(100%)
Conclusion:
Unlike JIT compiler, JIT interpreter still runs faster with localized functions due to MOV instruction.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Direct call
1.6065 sec(s)
1.516
1.843
1.62501
(120.38%)
2
Localized call
1.3345 sec(s)
1.297
1.647
1.35458
(100%)
Conclusion:
Localized function speeds up the code due to MOV instruction.
Avoid using unpack for small table with known size. As an alternative you can use this function:
do
local concat = table.concat
local loadstring = loadstring
function createunpack(n)
local ret = {"local t = ... return "}
for k = 1, n do
ret[1 + (k-1) * 4] = "t["
ret[2 + (k-1) * 4] = k
ret[3 + (k-1) * 4] = "]"
if k ~= n then ret[4 + (k-1) * 4] = "," end
end
return loadstring(concat(ret))
end
end
This function has 1 limitation. The maximum number of returned values is 248. The limit of LuaJIT unpack function is 7999 with default settings.
At least createunpack can create JIT-compiled unpack (unpack4 is basically createunpack(4))
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Indexing
3.73678 sec(s)
3.60006
4.61773
3.78408
(100%)
2
unpack
5.56231 sec(s)
5.12473
6.89518
5.69063
(148.85%)
3
unpack4
4.17394 sec(s)
4.12066
4.90567
4.34065
(111.69%)
Conclusion:
Avoid using unpack for small table with known size. As an alternative you can use the function mentioned on LuaJIT tab.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Indexing
12.272 sec(s)
11.929
14.207
12.38047
(115.97%)
2
unpack
10.5815 sec(s)
10
11.586
10.56572
(100%)
3
unpack4
14.855 sec(s)
14.491
18.836
15.07444
(140.38%)
Conclusion:
Any method is ok, unpack4 is the slowest probably because of the function call overhead.
If it's possible, localize your function and re-use it. If you need to provide a local to the closure try different approach of passing values. Simple example is changing state iterator to stateless.
Example of different value passing:
function func()
local a, b = 50, 10
timer.Simple(5, function()
print(a + b)
end)
end
In this example timer.Simple can't pass arguments to the function, we can change the style of value passing from function upvalues to main chunk upvalues:
local Ua, Ub
local function printAplusB()
print(Ua + Ub)
end
function func()
local a, b = 50, 10
Ua, Ub = a, b
timer.Simple(5, printAplusB)
end
a[0] or a.n are the best solution you can use. (If you have table.pack you may remember it creates a sequential table and adds n with the size of the created table this can be used for iteration)
JITed pairs is still slow but it will compile.
The results of this test for LuaJIT interpreter are confusing.
They were verified many times. Current goal is to email Mike Pall about these results and ask why are they so different.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
pairs
0.51711 sec(s)
0.48241
0.67666
0.5224
(100%)
2
JITed pairs
1.80467 sec(s)
1.62461
2.02158
1.77821
(348.99%) (3 times slower)
3
ipairs
1.70326 sec(s)
1.64163
2.1924
1.72125
(329.38%) (3 times slower)
4
Known length
0.67382 sec(s)
0.6603
0.85948
0.68079
(130.30%)
5
#a
0.6967 sec(s)
0.68416
0.74215
0.70065
(134.72%)
6
Upvalued length
0.67209 sec(s)
0.6611
0.77354
0.67794
(129.97%)
7
a.n
0.69201 sec(s)
0.66747
1.00413
0.7115
(133.82%)
8
a[0]
0.6715 sec(s)
0.66014
0.77048
0.67611
(129.85%)
Conclusion:
These results requires an explanation, no conclusion can be made.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
pairs
3.5325 sec(s)
3.241
3.968
3.51657
(193.03%)
2
ipairs
3.226 sec(s)
3.059
4.155
3.24595
(176.28%)
3
Known length
1.83 sec(s)
1.753
2.005
1.83169
(100%)
4
#a
1.8305 sec(s)
1.755
2.114
1.84612
(100.02%)
5
Upvalued length
1.8775 sec(s)
1.794
2.452
1.9197
(102.59%)
6
a.n
1.8815 sec(s)
1.773
2.248
1.89361
(102.81%)
7
a[0]
1.841 sec(s)
1.779
2.197
1.86906
(100.60%)
Conclusion:
a[0] and a.n are fast as in compiled LuaJIT.
11. Localizing table value for multiple usage🔗︎
Predefines:
local a = {}
for i = 1, 100 do
a[i] = {
x = 10
}
end
Code 1:
for n = 1, 100 do
a[n].x = a[n].x + 1
end
Code 2:
local a = a
for n = 1, 100 do
local y = a[n]
y.x = y.x + 1
end
You may localize your values for interpreter.
However, LuaJIT suggests not try to second-guess the JIT compiler because in compiled code locals and upvalues are used directly by their reference pointer, making over-localization may complicate the compiled code.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
No localization
19.37724 sec(s)
19.18286
21.82537
19.48628
(137.26%)
2
Localized a and a[n]
14.11709 sec(s)
13.8045
17.69585
14.28717
(100%)
Conclusion:
If your code can't compile, localization is best you can do here.
local a = {
[0] = 0,
n = 0
}
local tinsert = table.insert
local count = 1
-- Note: after each run of the code the table and count variable are restored to predefined state.
-- If you don't clean them after a test, table.insert will be super slow.
Code 1:
tinsert(a, times)
Code 2:
a[times] = times
Code 3:
a[#a + 1] = times
Code 4:
a[count] = times
count = count + 1
Code 5:
a.n = a.n + 1
a[a.n] = times
Code 6:
a[0] = a[0] + 1
a[a[0]] = times
Results (1M iterations):
Assembler Results:
tinsert: 65 instructions total.
a[times]: 62 instructions total.
a[#a + 1]: 72 instructions total.
a[count]: 78 instructions total.
a[a.n]: ~52
a[a[0]]: ~51
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
tinsert and a[#a + 1]
0.09972 sec(s)
0.09614
0.16774
0.10205
(1673.15%) (16 times slower)
2
a[times]
0.00596 sec(s)
0.00507
0.01528
0.00629
(100%)
3
a[count]
0.00655 sec(s)
0.00599
0.00806
0.00657
(109.89%)
4
a[a.n]
0.00689 sec(s)
0.006
0.00865
0.00696
(115.6%)
5
a[a[0]]
0.00833 sec(s)
0.00751
0.01167
0.00844
(139.76%)
Conclusion:
Using a local or a constant value is the fastest method.
If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1.
Instructions count may be incorrect due to my knowledge in assembler.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
tinsert
0.1522 sec(s)
0.14448
0.21487
0.15571
(112.44%)
2
a[times]
0.01899 sec(s)
0.01791
0.03054
0.0194
(14.03%) (7 times faster)
3
a[#a + 1]
0.13535 sec(s)
0.12965
0.17014
0.13644
(100%)
4
a[count]
0.0277 sec(s)
0.02617
0.03003
0.02779
(20.46%) (4 times faster)
5
a[a.n]
0.0368 sec(s)
0.03462
0.057
0.03752
(27.18%) (3 times faster)
6
a[a[0]]
0.0335 sec(s)
0.03114
0.04102
0.03386
(24.75%) (4 times faster)
Conclusion:
Please notice that percentage calculation is taken from the other result.
Using a local or a constant value is the fastest method. If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
tinsert
0.134 sec(s)
0.128
0.165
0.13653
(103.07%)
2
a[times]
0.06 sec(s)
0.057
0.066
0.06042
(46.15%) (2 times faster)
3
a[#a + 1]
0.13 sec(s)
0.125
0.162
0.13142
(100%)
4
a[count]
0.075 sec(s)
0.069
0.108
0.07713
(57.69%)
5
a[a.n]
0.188 sec(s)
0.179
0.245
0.19067
(144.61%)
6
a[a[0]]
0.255 sec(s)
0.246
0.292
0.25796
(196.15%)
Conclusion:
Please notice that percentage calculation is taken from the other result.
Using a local or a constant value is the fastest method. If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1.
13. Table with and without pre-allocated size🔗︎
Predefines:
local a
require("table.new")
local new = table.new
local ffinew = ffi.new
Code 1:
local a = {}
a[1] = 1
a[2] = 2
a[3] = 3
Code 2:
local a = {true, true, true}
a[1] = 1
a[2] = 2
a[3] = 3
Code 3 (table.new is available since LuaJIT v2.1.0-beta1):
local a = new(3,0)
a[1] = 1
a[2] = 2
a[3] = 3
Code 4:
local a = {1, 2, 3}
Code 5 (FFI):
local a = ffinew("int[3]", 1, 2, 3)
Code 6 (FFI):
local a = ffinew("int[3]")
a[0] = 1
a[1] = 2
a[2] = 3
(FFI) Defined in constructor: 18 instructions total.
(FFI) Defined after: 18 instructions total.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Allocated on demand
1.26337 sec(s)
1.24794
1.53786
1.2751
(39480.31%) (394 times slower)
2
Pre-allocated with dummy values
0.0032 sec(s)
0.00312
0.00358
0.00322
(100%)
3
Pre-allocated by table.new
0.41859 sec(s)
0.4055
0.49486
0.42476
(13080.93%) (130 times slower)
4
Defined in constructor
0.00325 sec(s)
0.00306
0.00411
0.00329
(101.56%)
5
(FFI) Defined in constructor
0.00325 sec(s)
0.0031
0.00425
0.00331
(101.56%)
6
(FFI) Defined after
0.00339 sec(s)
0.00312
0.00463
0.00351
(105.93%)
Conclusion:
Pre-allocation will speed up your code if you need more speed.
In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Allocated on demand
1.73737 sec(s)
1.7137
1.90643
1.74489
(310.27%) (3 times slower)
2
Pre-allocated with dummy values
0.61846 sec(s)
0.61472
0.6396
0.61924
(110.44%)
3
Pre-allocated by table.new
0.86155 sec(s)
0.81076
1.41348
0.86788
(153.86%)
4
Defined in constructor
0.55995 sec(s)
0.53821
0.63602
0.56426
(100%)
5
(FFI) Defined in constructor
3.09061 sec(s)
2.94983
3.91517
3.18377
(551.94%) (5 times slower)
6
(FFI) Defined after
4.46811 sec(s)
4.18024
5.32326
4.61457
(797.94%) (7 times slower)
Conclusion:
Pre-allocation will speed up your code if you need more speed.
In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand.
If you don't need to use FFI array don't use it for the CPU optimization (unless for RAM).
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Allocated on demand
5.304 sec(s)
5.243
5.694
5.32726
(196.88%)
2
Pre-allocated with dummy values
2.863 sec(s)
2.676
3.763
2.9231
(106.27%)
3
Defined in constructor
2.694 sec(s)
2.303
3.364
2.65954
(100%)
Conclusion:
Pre-allocation will speed up your code if you need more speed.
In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand.
14. Table initialization before or each time on insertion🔗︎
Predefines:
local T = {}
local CachedTable = {"abc", "def", "ghk"}
local text = "Hello, this is an example text"
local cstring = ffi.cast("const char*", text)
local char = string.char
local sub, gsub, gmatch = string.sub, string.gsub, string.gmatch
local gsubfunc = function(s)
local x = s
end
Code 1:
for i = 1, #text do
local x = sub(text, i, i)
end
Code 2:
for k in gmatch(text, ".") do
local x = k
end
Code 3:
gsub(text, ".", gsubfunc)
Code 4 (FFI):
for i = 0, #text - 1 do
local x = char(cstring[i])
end
If you're using FFI on LuaJIT 2.1.0 and higher, splitting will be the fastest.
Probably you wouldn't need to split it because ffi arrays are mutable, so all text manipulations can be done directly. Otherwise use string.sub.
It's recommended to use string.find, string.match, etc if possible. Splitting each char wastes GC.
Please notice that percentage calculation is taken from the other result.
This is an example when LuaJIT fails to optimize and compile code efficiently. The loop wasn't unrolled properly.
LuaJIT suggest to find a balance between loops and unrolls and use templates.
table.concat is best solution in complicated code, however, if it's possible make concats inline or unroll loops.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Inline concat
1.44256 sec(s)
1.42674
1.76183
1.46447
(100%)
2
Separate concat
5.82289 sec(s)
5.44671
7.76331
5.9645
(403.64%) (4 times slower)
3
Loop concat
6.61971 sec(s)
5.70944
7.64707
6.6218
(458.88%) (4 times slower)
4
table.concat
1.49022 sec(s)
1.41849
1.95012
1.56112
(103.30%)
5
string.format
1.46481 sec(s)
1.42773
2.05097
1.52796
(101.54%)
Conclusion:
If it's possible inline your concats, otherwise use table.concat.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
Inline concat
1.023 sec(s)
1.01
1.296
1.04552
(100%)
2
Separate concat
10.445 sec(s)
9.918
12.909
10.63149
(1021.01%) (10 times slower)
3
Loop concat
11.723 sec(s)
9.919
14.472
11.64345
(1145.94%) (11 times slower)
4
table.concat
2.151 sec(s)
2.083
2.378
2.16366
(210.26%) (2 times slower)
5
string.format
2.179 sec(s)
2.116
3.099
2.26572
(213%) (2 times slower)
Conclusion:
If it's possible inline your concats, otherwise use table.concat.
local TYPE_bool = "bool"
local type = type
local function isbool1(b)
return type(b) == "bool"
end
local function isbool2(b)
return type(b) == TYPE_bool
end
Code 1:
isbool1(false)
Code 2:
isbool2(false)
Results (10M iterations):
Assembler Results:
KGC string: 18 instructions total.
Upvalued string: 18 instructions total.
Conclusion:
LuaJIT compiles them with the same performance.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
KGC string
0.39173 sec(s)
0.37698
0.63159
0.41579
(100%)
2
Upvalued string
0.40781 sec(s)
0.3934
0.51813
0.4151
(104.10%)
Conclusion:
If possible use literal strings in the function.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
KGC string
1.324 sec(s)
1.26
1.99
1.37005
(100%)
2
Upvalued string
1.3915 sec(s)
1.268
1.773
1.40522
(105.09%)
Conclusion:
If possible use literal strings in the function.
20. Taking a value from a function with multiple returns🔗︎
Predefines:
local function funcmret()
return 1, 2
end
local select = select
Code 1:
local _, arg2 = funcmret()
return arg2
Code 2:
local arg2 = select(2, funcmret())
return arg2
Results (10M iterations):
Assembler Results:
With dummy variables: 18 instructions total.
select: 18 instructions total.
Conclusion:
LuaJIT compiles them with the same performance.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
With dummy variables
0.25193 sec(s)
0.24568
0.27575
0.25267
(100%)
2
select
0.38455 sec(s)
0.37498
0.4397
0.38579
(152.63%)
Conclusion:
select makes no sense for functions with less than 10 (at least) returned values, all returned values are pushed to the stack. Any value you choose will can be pushed up individually.
Tip: if you need only first argument wrap the function call in the parenthesizes.
print( (math.frexp(0)) )
This will print only the first value.
Benchmark Results:
#
Name
Median
Minimum
Maximum
Average
Percentage
1
With dummy variables
0.611 sec(s)
0.6
0.702
0.61562
(100%)
2
select
0.813 sec(s)
0.786
0.926
0.81984
(133.06%)
Conclusion:
select makes no sense for functions with less than 10 (at least) returned values, all returned values are pushed to the stack. Any value you choose will can be pushed up individually.
Tip: if you need only first argument wrap the function call in the parenthesizes.