Third edition is in progress.
LuaJIT Benchmarks
About
These benchmark tests demonstrate the performance of LuaJIT compiler, LuaJIT interpreter and Lua 5.1. LuaJIT stated that globals and locals now has the same performance unlike in plain Lua. LuaJIT stated that it's faster than Lua. Even Lua suggests to use LuaJIT for more performance. LuaJIT uses its own interpreter and compiler and many other optimizations to improve the performance. But is it really fast?
About tests
This site contains results and conclusions for LuaJIT compiler, LuaJIT interpreter and Lua 5.1.4. LuaJIT interpreter is accounted because it's a useful information for functions in which you 100% sure they won't compile. Or maybe you're using embedded LuaJIT 2.0 which aborts on any C function (And FFI is disabled). Lua 5.1 is accounted for you decision in what to choose, or just out of curiosity. First 14 benchmark tests were taken from this page New benchmark tests are welcome. Specs: Intel i5-6500 3.20 GHz. 64-bit. LuaJIT 2.1.0-beta3. (Lua 5.1.4 for plain Lua tests) (LuaJIT 2.0.4 for LuaJIT 2.0 assembler tests) (JIT: ON SSE2 SSE3 SSE4.1 BMI2 fold cse dce fwd dse narrow loop abc sink fuse)
Benchmark Code
Source code For benchmark tests we use the median of 100 takes of the given amount of iterations of the code.
for take = 1, 100 do local START = os.clock() for times = 1, iterations do ... end local END = os.clock() end
For assembler tests we use luajit -jdump=+Arsa asmbench.lua. The total amount of instructions is based on maximum possible amount (Last jump or RET). Bytecode size is used from -jdump, not -bl, so it also counts sub-functions instructions and headers. Script for bytecode test.
Useful links
Things which are likely to cause NYI aborts from the JIT compiler Tips for writing performant Lua code LuaJIT 2.0 Bytecode reference LuaJIT official site
Contents
1. Local vs Global 2. Local vs Global table indexing 3. Localized method (3 calls) 4. Unpack 5. Find and return maximum value 6. "not a" vs "a or b" 7. "x ^ 2" vs "x * x" vs "math.pow" 8. "math.fmod" vs "%" operator 9. Predefined function or anonymous function in the argument 10. for loops 11. Localizing table value for multiple usage 12. Array insertion 13. Table with and without pre-allocated size 14. Table initialization before or each time on insertion 15. String split (by character) 16. Empty string check 17. C array size (FFI) 18. String concatenation 19. String in a function 20. Taking a value from a function with multiple returns
1. Local vs Global🔗︎
Predefines:
local t = type
Code 1:
type(3)
Code 2:
t(3)
Results (10M iterations):
Assembler Results:
  1. Global: 29 instructions total.
  2. Local: 18 instructions total.
    mov dword [0x3e2d0410], 0x1 movsd xmm7, [rdx+0x40] cvttsd2si eax, xmm7 xorps xmm6, xmm6 cvtsi2sd xmm6, eax ucomisd xmm7, xmm6 jnz 0x7ffa543c0010 ->0 jpe 0x7ffa543c0010 ->0 cmp eax, 0x7ffffffe jg 0x7ffa543c0010 ->0 cvttsd2si edi, [rdx+0x38] cmp dword [rdx+0x14], -0x09 jnz 0x7ffa543c0010 ->0 cmp dword [rdx+0x10], 0x3e2d8228 - jnz 0x7ffa543c0010 ->0 - mov edx, [0x3e2d8230] - cmp dword [rdx+0x1c], +0x3f - jnz 0x7ffa543c0010 ->0 - mov ecx, [rdx+0x14] - mov rsi, 0xfffffffb3e2d2f88 - cmp rsi, [rcx+0x5a8] - jnz 0x7ffa543c0010 ->0 - cmp dword [rcx+0x5a4], -0x09 - jnz 0x7ffa543c0010 ->0 - cmp dword [rcx+0x5a0], 0x3e2d2ef0 jnz 0x7ffa543c0010 ->0 add edi, +0x01 cmp edi, eax jg 0x7ffa543c0014 ->1
Conclusion:
Each global lookup can cost around 11 instructions. They both run almost on the same speed, but this benchmark tests only one global. This is still a good practice to localize all variables you need.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Global0.24571 sec(s)0.239290.296170.24856(102.83%)
2Local0.23894 sec(s)0.229180.327410.24434(100%)
Conclusion:
JIT matches the performance of globals and upvalues. This is a good practice to localize all variables you need, upvalues are still faster.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Global0.9605 sec(s)0.9371.0750.97355(111.42%)
2Local0.862 sec(s)0.8450.9390.86418(100%)
Conclusion:
Upvalues are faster than globals.
2. Local vs Global table indexing🔗︎
Predefines:
local s = math.sin
Code 1:
math.sin(3.14)
Code 2:
s(3.14)
Results (10M iterations):
Assembler Results:
  1. Global table indexing: 38 instructions total.
  2. Local: 18 instructions total.
    mov dword [0x24660410], 0x1 movsd xmm7, [rdx+0x40] cvttsd2si eax, xmm7 xorps xmm6, xmm6 cvtsi2sd xmm6, eax ucomisd xmm7, xmm6 jnz 0x7ffa543c0010 ->0 jpe 0x7ffa543c0010 ->0 cmp eax, 0x7ffffffe jg 0x7ffa543c0010 ->0 cvttsd2si edi, [rdx+0x38] cmp dword [rdx+0x14], -0x09 - jnz 0x7ffa543c0010 ->0 - cmp dword [rdx+0x10], 0x2467f788 - jnz 0x7ffa543c0010 ->0 - mov ebp, [0x2467f790] - cmp dword [rbp+0x1c], +0x3f - jnz 0x7ffa543c0010 ->0 - mov ebx, [rbp+0x14] - mov rsi, 0xfffffffb24665fd8 - cmp rsi, [rbx+0x518] - jnz 0x7ffa543c0010 ->0 - cmp dword [rbx+0x514], -0x0c jnz 0x7ffa543c0010 ->0 + cmp dword [rdx+0x18], 0x2466aca0 - mov edx, [rbx+0x510] - cmp dword [rdx+0x1c], +0x1f - jnz 0x7ffa543c0010 ->0 - mov ecx, [rdx+0x14] - mov rsi, 0xfffffffb24666548 - cmp rsi, [rcx+0x230] - jnz 0x7ffa543c0010 ->0 - cmp dword [rcx+0x22c], -0x09 - jnz 0x7ffa543c0010 ->0 - cmp dword [rcx+0x228], 0x24666520 jnz 0x7ffa543c0010 ->0 add edi, +0x01 cmp edi, eax jg 0x7ffa543c0014 ->1
Conclusion:
As the first test concluded, each table indexing can cost around 11 additional instructions. Localizing math table won't help much. Localize your variables.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Global table indexing0.36948 sec(s)0.363570.39080.37024(146.27%)
2Local0.25259 sec(s)0.249560.266110.25344(100%)
Conclusion:
Localizing exact value will get you more performance.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Global table indexing0.9335 sec(s)0.8841.0390.93893(120.84%)
2Local0.7725 sec(s)0.7330.8890.77743(100%)
Conclusion:
Localizing exact value will get you more performance.
3. Localized method (3 calls)🔗︎
Predefines:
local class = { test = function() return 1 end }
Code 1:
class.test() class.test() class.test()
Code 2:
local test = class.test test() test() test()
Results (10M iterations):
Assembler Results:
  1. Direct call: 35 instructions total.
  2. Localized call: 35 instructions total.
Conclusion:
LuaJIT compiles them with the same performance. However, LuaJIT suggests not to second-guess the JIT compiler, because unnecessary localization can create more complicated code. Localizing local c = a+b for z = x[a+b] + y[a+b] is redundant. JIT perfectly compiles such code as a[i][j] = a[i][j] * a[i][j+1].
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Direct call0.5611 sec(s)0.550.60990.56474(120.64%)
2Localized call0.46508 sec(s)0.45630.58340.46996(100%)
Conclusion:
Unlike JIT compiler, JIT interpreter still runs faster with localized functions due to MOV instruction.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Direct call1.6065 sec(s)1.5161.8431.62501(120.38%)
2Localized call1.3345 sec(s)1.2971.6471.35458(100%)
Conclusion:
Localized function speeds up the code due to MOV instruction.
4. Unpack🔗︎
Predefines:
local min = math.min local unpack = unpack local a = {100, 200, 300, 400} local function unpack4(a) return a[1], a[2], a[3], a[4] end
Code 1:
min(a[1], a[2], a[3], a[4])
Code 2:
min(unpack(a))
Code 3:
min(unpack4(a))
Results (100M iterations):
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Indexing and unpack40.03403 sec(s)0.03160.056070.03531(100%)
2unpack4.71555 sec(s)4.541125.60544.75909(13858.80%) (138 times slower)
Assembler Results:
  1. Indexing: 36 instructions total.
  2. unpack: 46 instructions total. NYI on LuaJIT 2.0Fallbacks to interpreter on LuaJIT 2.1
  3. unpack4: 36 instructions total.
Conclusion:
Avoid using unpack for small table with known size. As an alternative you can use this function:
do local concat = table.concat local loadstring = loadstring function createunpack(n) local ret = {"local t = ... return "} for k = 1, n do ret[1 + (k-1) * 4] = "t[" ret[2 + (k-1) * 4] = k ret[3 + (k-1) * 4] = "]" if k ~= n then ret[4 + (k-1) * 4] = "," end end return loadstring(concat(ret)) end end
This function has 1 limitation. The maximum number of returned values is 248. The limit of LuaJIT unpack function is 7999 with default settings. At least createunpack can create JIT-compiled unpack (unpack4 is basically createunpack(4))
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Indexing3.73678 sec(s)3.600064.617733.78408(100%)
2unpack5.56231 sec(s)5.124736.895185.69063(148.85%)
3unpack44.17394 sec(s)4.120664.905674.34065(111.69%)
Conclusion:
Avoid using unpack for small table with known size. As an alternative you can use the function mentioned on LuaJIT tab.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Indexing12.272 sec(s)11.92914.20712.38047(115.97%)
2unpack10.5815 sec(s)1011.58610.56572(100%)
3unpack414.855 sec(s)14.49118.83615.07444(140.38%)
Conclusion:
Any method is ok, unpack4 is the slowest probably because of the function call overhead.
5. Find and return maximum value🔗︎
Predefines:
local max = math.max local num = 100 local y = 0
Code 1:
local x = max(num, y)
Code 2:
if (num > y) then local x = num end
Code 3:
local x = num > y and num or x
Results (10M iterations):
Assembler Results:
  1. math.min: 18 instructions total.
  2. if (num > y) then: 18 instructions total.
  3. a and b or c: 18 instructions total.
Conclusion:
LuaJIT compiles them with the same performance.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1math.min0.23708 sec(s)0.226860.282230.23841(147.10%)
2if (num > y) then0.16116 sec(s)0.158140.1690.16173(100%)
3a and b or c0.18716 sec(s)0.182910.199070.18795(116.13%)
Conclusion:
math.min has a function overhead, which probably makes it slower.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1math.min0.647 sec(s)0.6210.7310.65725(134.93%)
2if (num > y) then0.4795 sec(s)0.4640.560.48323(100%)
3a and b or c0.528 sec(s)0.5010.7260.54099(110.11%)
Conclusion:
math.min has a function overhead, which probably makes it slower.
6. "not a" vs "a or b"🔗︎
Predefines:
local y
Code 1:
if not y then local x = 1 else local x = y end
Code 2:
local x = y or 1
Results (10M iterations):
Assembler Results:
  1. if: 24 instructions total.
  2. a or b: 24 instructions total.
Conclusion:
LuaJIT compiles them with the same performance.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1if0.13572 sec(s)0.130330.191830.13831(100%)
2a or b0.13608 sec(s)0.122980.236680.13989(100.26%)
Conclusion:
a or b should be faster due to unary test and copy instructions ISTC and ISFC.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1if0.398 sec(s)0.3820.4840.40146(112.11%)
2a or b0.355 sec(s)0.3490.3670.35508(100%)
Conclusion:
a or b should be faster due to TESTSET instruction.
7. "x ^ 2" vs "x * x" vs "math.pow"🔗︎
Predefines:
local x = 10 local pow = math.pow
Code 1:
local y = x ^ 2
Code 2:
local y = x * x
Code 3:
local y = pow(x, 2)
Results (10M iterations):
Assembler Results:
  1. x ^ 2: 18 instructions total.
  2. x * x: 18 instructions total.
  3. math.pow: 18 instructions total.
Conclusion:
LuaJIT compiles them with the same performance.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1x ^ 20.60192 sec(s)0.56710.852340.61451(442.13%) (4 times slower)
2x * x0.13614 sec(s)0.132370.191820.13845(100%)
3math.pow0.69741 sec(s)0.597530.950670.70204(512.26%) (5 times slower)
Conclusion:
Use multiply instead of power if you know the exact exponent. math.pow has a function overhead.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1x ^ 21.044 sec(s)1.0231.2421.05641(269.07%) (2 times slower)
2x * x0.388 sec(s)0.3760.4560.39083(100%)
3math.pow1.3125 sec(s)1.2111.5291.32576(338.27%) (3 times slower)
Conclusion:
Use multiply instead of power if you know the exact exponent. math.pow has a function overhead.
8. "math.fmod" vs "%" operator🔗︎
Predefines:
local fmod = math.fmod local function jit_fmod(a, b) if b < 0 then b = -b end if a < 0 then return -(-a % b) else return a % b end end
Code 1:
local x = fmod(times, 30)
Code 2:
local x = (times % 30)
Code 3:
local x = jit_fmod(times, 30)
Results (10M iterations):
Assembler Results:
  1. fmod: 55 instructions total. NYI on LuaJIT 2.0Stitches on LuaJIT 2.1
  2. %: 18 instructions total.
  3. JITed fmod: 20 instructions total.
    mov dword [0x2add0410], 0x4 movsd xmm7, [rdx+0x48] cvttsd2si eax, xmm7 xorps xmm6, xmm6 cvtsi2sd xmm6, eax ucomisd xmm7, xmm6 jnz 0x7ffa543c0010 ->0 jpe 0x7ffa543c0010 ->0 cmp eax, 0x7ffffffe jg 0x7ffa543c0010 ->0 cvttsd2si edi, [rdx+0x40] cmp dword [rdx+0x24], -0x09 jnz 0x7ffa543c0010 ->0 cmp dword [rdx+0x20], 0x2adf2c00 jnz 0x7ffa543c0010 ->0 + test edi, edi + jl 0x7ffa543c0014 ->1 add edi, +0x01 cmp edi, eax jg 0x7ffa543c0018 ->2
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1fmod0.2961 sec(s)0.28850.49840.30364(7670.98%) (76 times slower)
2% and JITed fmod0.00386 sec(s)0.003050.006430.00401(100%)
Conclusion:
Use % for positive modulo. For negative or mixed modulo use JITed fmod.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1fmod0.33687 sec(s)0.31470.424870.34024(239.59%) (2 times slower)
2%0.1406 sec(s)0.131660.210370.14226(100%)
3JITed fmod0.35584 sec(s)0.343780.521990.36319(253.07%) (2 times slower)
Conclusion:
JITed fmod solves compilation problem but it's slower in interpreter mode
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1fmod0.7055 sec(s)0.6570.8420.7135(182.77%)
2%0.386 sec(s)0.3740.4710.39029(100%)
3JITed fmod0.858 sec(s)0.8121.1270.8753(222.27%) (2 times slower)
Conclusion:
JITed fmod is not recommended for plain Lua. Use module operator for positive numbers and math.fmod for negative and mixed.
9. Predefined function or anonymous function in the argument🔗︎
Predefines:
local func1 = function(a, b, func) return func(a + b) end local func2 = function(a) return a * 2 end
Code 1:
local x = func1(1, 2, function(a) return a * 2 end)
Code 2:
local x = func1(1, 2, func2)
Results (10M iterations):
Assembler Results:
  1. Function in argument: NYI on LuaJIT 2.1
  2. Localized function: 18 instructions total.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Function in argument0.61696 sec(s)0.571840.818760.62018(18791.11%) (187 times slower)
2Localized function0.00328 sec(s)0.003070.007330.00354(100%)
Conclusion:
If it's possible, localize your function and re-use it. If you need to provide a local to the closure try different approach of passing values. Simple example is changing state iterator to stateless. Example of different value passing:
function func() local a, b = 50, 10 timer.Simple(5, function() print(a + b) end) end
In this example timer.Simple can't pass arguments to the function, we can change the style of value passing from function upvalues to main chunk upvalues:
local Ua, Ub local function printAplusB() print(Ua + Ub) end function func() local a, b = 50, 10 Ua, Ub = a, b timer.Simple(5, printAplusB) end
Moving function outside allows to compile func.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Function in argument0.61991 sec(s)0.587070.889680.63932(166.80%)
2Localized function0.37165 sec(s)0.335820.43460.37056(100%)
Conclusion:
If it's possible, localize your function and re-use it. See a possible solution on LuaJIT tab.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Function in argument1.7375 sec(s)1.6462.0831.75102(185.63%)
2Localized function0.936 sec(s)0.9151.0280.94642(100%)
Conclusion:
If it's possible, localize your function and re-use it. See a possible solution on LuaJIT tab.
10. for loops🔗︎
Predefines:
local a = {} for i = 1, 100 do a[i] = i end a.n = 100 a[0] = 100 local length = #a local nxt = next function jit_pairs(t) return nxt, t end
Code 1:
for k, v in pairs(a) do local x = v end
Code 2 (Using JITed next on 2.1.0-beta2):
for k, v in jit_pairs(a) do local x = v end
Code 3:
for k, v in ipairs(a) do local x = v end
Code 4:
for i = 1, 100 do local x = a[i] end
Code 5:
for i = 1, #a do local x = a[i] end
Code 6:
for i = 1, length do local x = a[i] end
Code 7:
for i = 1, a.n do local x = a[i] end
Code 8:
for i = 1, a[0] do local x = a[i] end
Results (1M iterations):
May be incorrect. Awaits recalculation.
Assembler Results:
  1. pairs: NYI on LuaJIT 2.1
  2. JITed pairs: 119 instructions total.
  3. Known length: 56 instructions total.
  4. ipairs: 104 instructions total.
  5. #a: 78 instructions total.
  6. Upvalued length: 60 instructions total.
  7. a.n: 89 instructions total.
  8. a[0]: 80 instructions total.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1pairs0.51975 sec(s)0.484280.634950.52525(757.87%) (7 times slower)
2JITed pairs0.41983 sec(s)0.390410.528630.41906(612.17%) (6 times slower)
3ipairs0.12707 sec(s)0.121640.208610.13086(185.28%)
4Known length0.11527 sec(s)0.112520.153290.1175(168.08%)
5#a0.12063 sec(s)0.102350.171380.1199(175.89%)
6Upvalued length0.08333 sec(s)0.078750.178070.08744(121.50%)
7a.n0.08724 sec(s)0.084480.100260.08813(127.20%)
8a[0]0.06858 sec(s)0.066730.098730.07049(100%)
Conclusion:
a[0] or a.n are the best solution you can use. (If you have table.pack you may remember it creates a sequential table and adds n with the size of the created table this can be used for iteration) JITed pairs is still slow but it will compile.
The results of this test for LuaJIT interpreter are confusing. They were verified many times. Current goal is to email Mike Pall about these results and ask why are they so different.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1pairs0.51711 sec(s)0.482410.676660.5224(100%)
2JITed pairs1.80467 sec(s)1.624612.021581.77821(348.99%) (3 times slower)
3ipairs1.70326 sec(s)1.641632.19241.72125(329.38%) (3 times slower)
4Known length0.67382 sec(s)0.66030.859480.68079(130.30%)
5#a0.6967 sec(s)0.684160.742150.70065(134.72%)
6Upvalued length0.67209 sec(s)0.66110.773540.67794(129.97%)
7a.n0.69201 sec(s)0.667471.004130.7115(133.82%)
8a[0]0.6715 sec(s)0.660140.770480.67611(129.85%)
Conclusion:
These results requires an explanation, no conclusion can be made.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1pairs3.5325 sec(s)3.2413.9683.51657(193.03%)
2ipairs3.226 sec(s)3.0594.1553.24595(176.28%)
3Known length1.83 sec(s)1.7532.0051.83169(100%)
4#a1.8305 sec(s)1.7552.1141.84612(100.02%)
5Upvalued length1.8775 sec(s)1.7942.4521.9197(102.59%)
6a.n1.8815 sec(s)1.7732.2481.89361(102.81%)
7a[0]1.841 sec(s)1.7792.1971.86906(100.60%)
Conclusion:
a[0] and a.n are fast as in compiled LuaJIT.
11. Localizing table value for multiple usage🔗︎
Predefines:
local a = {} for i = 1, 100 do a[i] = { x = 10 } end
Code 1:
for n = 1, 100 do a[n].x = a[n].x + 1 end
Code 2:
local a = a for n = 1, 100 do local y = a[n] y.x = y.x + 1 end
Results (10M iterations):
Assembler Results:
  1. No localization: 64 instructions total.
  2. Localized a and a[n]: 64 instructions total.
    mov dword [0x29550410], 0x4 mov edx, [0x295504b4] movsd xmm15, [0x295807c0] movsd xmm14, [0x29580790] movsd xmm6, [0x29580780] cmp dword [rdx-0x4], 0x2955bb3c jnz 0x7ffa5abd0014 ->1 add edx, -0x60 mov [0x295504b4], edx movsd xmm13, [rdx+0x40] movsd xmm7, [rdx+0x38] addsd xmm7, xmm15 ucomisd xmm13, xmm7 jb 0x7ffa5abd0018 ->2 cmp dword [rdx+0x14], -0x09 + jnz 0x7ffa5abd001c ->3 + cmp dword [rdx+0x18], 0x2959ab30 + jnz 0x7ffa5abd001c ->3 + mov edi, [0x2959ab20] + add edi, -0x08 + cmp edi, edx jnz 0x7ffa5abd001c ->3 cmp dword [rdx+0x10], 0x2959aaf0 jnz 0x7ffa5abd001c ->3 + mov edi, [rdx+0x8] + movsd [rdx+0x88], xmm15 movsd [rdx+0x80], xmm15 movsd [rdx+0x78], xmm15 movsd [rdx+0x70], xmm14 - movsd [rdx+0x68], xmm15 + mov dword [rdx+0x6c], 0xfffffff4 + mov [rdx+0x68], edi movsd [rdx+0x60], xmm6 mov dword [rdx+0x5c], 0x2955bb3c mov dword [rdx+0x58], 0x2959aaf0 movsd [rdx+0x38], xmm7 add edx, +0x60 mov [0x295504b4], edx jmp 0x7ffa5abdfdc1 mov dword [0x29550410], 0x2 movsd xmm0, [0x29592110] cvttsd2si edi, [rdx+0x8] - mov r10d, [rdx-0x8] - mov esi, [r10+0x14] - mov r9d, [rsi+0x10] - mov ebx, r9d - sub ebx, edx - cmp ebx, +0x30 - jbe 0x7ffa5abd0010 ->0 cmp dword [r9+0x4], -0x0c jnz 0x7ffa5abd0010 ->0 mov r8d, [r9] cmp dword [r8+0x18], +0x64 jbe 0x7ffa5abd0010 ->0 mov eax, [r8+0x8] cmp dword [rax+rdi*8+0x4], -0x0c jnz 0x7ffa5abd0010 ->0 mov edx, [rax+rdi*8] - cmp ebx, +0x38 - jbe 0x7ffa5abd0010 ->0 cmp dword [rdx+0x1c], +0x01 jnz 0x7ffa5abd0010 ->0 mov ecx, [rdx+0x14] mov rsi, 0xfffffffb2955a520 cmp rsi, [rcx+0x20] jnz 0x7ffa5abd0010 ->0 cmp dword [rcx+0x1c], 0xfffeffff jnb 0x7ffa5abd0010 ->0 movsd xmm1, [rcx+0x18] addsd xmm1, xmm0 movsd [rcx+0x18], xmm1 add edi, +0x01 cmp edi, +0x64 jg 0x7ffa5abd0014 ->1
Conclusion:
You may localize your values for interpreter. However, LuaJIT suggests not try to second-guess the JIT compiler because in compiled code locals and upvalues are used directly by their reference pointer, making over-localization may complicate the compiled code.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1No localization19.37724 sec(s)19.1828621.8253719.48628(137.26%)
2Localized a and a[n]14.11709 sec(s)13.804517.6958514.28717(100%)
Conclusion:
If your code can't compile, localization is best you can do here.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1No localization58.832 sec(s)56.51170.48560.3759(120.64%)
2Localized a and a[n]48.7655 sec(s)41.02153.68346.57903(100%)
Conclusion:
Localization speeds up the code.
12. Array insertion🔗︎
Predefines:
local a = { [0] = 0, n = 0 } local tinsert = table.insert local count = 1 -- Note: after each run of the code the table and count variable are restored to predefined state. -- If you don't clean them after a test, table.insert will be super slow.
Code 1:
tinsert(a, times)
Code 2:
a[times] = times
Code 3:
a[#a + 1] = times
Code 4:
a[count] = times count = count + 1
Code 5:
a.n = a.n + 1 a[a.n] = times
Code 6:
a[0] = a[0] + 1 a[a[0]] = times
Results (1M iterations):
Assembler Results:
  1. tinsert: 65 instructions total.
  2. a[times]: 62 instructions total.
  3. a[#a + 1]: 72 instructions total.
  4. a[count]: 78 instructions total.
  5. a[a.n]: ~52
  6. a[a[0]]: ~51
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1tinsert and a[#a + 1]0.09972 sec(s)0.096140.167740.10205(1673.15%) (16 times slower)
2a[times]0.00596 sec(s)0.005070.015280.00629(100%)
3a[count]0.00655 sec(s)0.005990.008060.00657(109.89%)
4a[a.n]0.00689 sec(s)0.0060.008650.00696(115.6%)
5a[a[0]]0.00833 sec(s)0.007510.011670.00844(139.76%)
Conclusion:
Using a local or a constant value is the fastest method. If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1. Instructions count may be incorrect due to my knowledge in assembler.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1tinsert0.1522 sec(s)0.144480.214870.15571(112.44%)
2a[times]0.01899 sec(s)0.017910.030540.0194(14.03%) (7 times faster)
3a[#a + 1]0.13535 sec(s)0.129650.170140.13644(100%)
4a[count]0.0277 sec(s)0.026170.030030.02779(20.46%) (4 times faster)
5a[a.n]0.0368 sec(s)0.034620.0570.03752(27.18%) (3 times faster)
6a[a[0]]0.0335 sec(s)0.031140.041020.03386(24.75%) (4 times faster)
Conclusion:
Please notice that percentage calculation is taken from the other result. Using a local or a constant value is the fastest method. If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1tinsert0.134 sec(s)0.1280.1650.13653(103.07%)
2a[times]0.06 sec(s)0.0570.0660.06042(46.15%) (2 times faster)
3a[#a + 1]0.13 sec(s)0.1250.1620.13142(100%)
4a[count]0.075 sec(s)0.0690.1080.07713(57.69%)
5a[a.n]0.188 sec(s)0.1790.2450.19067(144.61%)
6a[a[0]]0.255 sec(s)0.2460.2920.25796(196.15%)
Conclusion:
Please notice that percentage calculation is taken from the other result. Using a local or a constant value is the fastest method. If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1.
13. Table with and without pre-allocated size🔗︎
Predefines:
local a require("table.new") local new = table.new local ffinew = ffi.new
Code 1:
local a = {} a[1] = 1 a[2] = 2 a[3] = 3
Code 2:
local a = {true, true, true} a[1] = 1 a[2] = 2 a[3] = 3
Code 3 (table.new is available since LuaJIT v2.1.0-beta1):
local a = new(3,0) a[1] = 1 a[2] = 2 a[3] = 3
Code 4:
local a = {1, 2, 3}
Code 5 (FFI):
local a = ffinew("int[3]", 1, 2, 3)
Code 6 (FFI):
local a = ffinew("int[3]") a[0] = 1 a[1] = 2 a[2] = 3
Results (10M iterations):
Assembler Results:
  1. Allocated on demand: 96 instructions total.
  2. Pre-allocated with dummy values: 18 instructions total.
  3. Pre-allocated by table.new: 82 instructions total.
  4. Defined in constructor: 18 instructions total.
  5. (FFI) Defined in constructor: 18 instructions total.
  6. (FFI) Defined after: 18 instructions total.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Allocated on demand1.26337 sec(s)1.247941.537861.2751(39480.31%) (394 times slower)
2Pre-allocated with dummy values0.0032 sec(s)0.003120.003580.00322(100%)
3Pre-allocated by table.new0.41859 sec(s)0.40550.494860.42476(13080.93%) (130 times slower)
4Defined in constructor0.00325 sec(s)0.003060.004110.00329(101.56%)
5(FFI) Defined in constructor0.00325 sec(s)0.00310.004250.00331(101.56%)
6(FFI) Defined after0.00339 sec(s)0.003120.004630.00351(105.93%)
Conclusion:
Pre-allocation will speed up your code if you need more speed. In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Allocated on demand1.73737 sec(s)1.71371.906431.74489(310.27%) (3 times slower)
2Pre-allocated with dummy values0.61846 sec(s)0.614720.63960.61924(110.44%)
3Pre-allocated by table.new0.86155 sec(s)0.810761.413480.86788(153.86%)
4Defined in constructor0.55995 sec(s)0.538210.636020.56426(100%)
5(FFI) Defined in constructor3.09061 sec(s)2.949833.915173.18377(551.94%) (5 times slower)
6(FFI) Defined after4.46811 sec(s)4.180245.323264.61457(797.94%) (7 times slower)
Conclusion:
Pre-allocation will speed up your code if you need more speed. In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand. If you don't need to use FFI array don't use it for the CPU optimization (unless for RAM).
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Allocated on demand5.304 sec(s)5.2435.6945.32726(196.88%)
2Pre-allocated with dummy values2.863 sec(s)2.6763.7632.9231(106.27%)
3Defined in constructor2.694 sec(s)2.3033.3642.65954(100%)
Conclusion:
Pre-allocation will speed up your code if you need more speed. In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand.
14. Table initialization before or each time on insertion🔗︎
Predefines:
local T = {} local CachedTable = {"abc", "def", "ghk"}
Code 1:
T[times] = CachedTable
Code 2:
T[times] = {"abc", "def", "ghk"}
Results (10M iterations):
Assembler Results:
  1. Cached table for all insertion: ~46
  2. Table constructor for each insertion: ~50
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Cached table for all insertion0.00881 sec(s)0.007780.015540.00892(100%)
2Table constructor for each insertion0.2196 sec(s)0.197852.713650.33673(2493.19%) (24 times slower)
Conclusion:
If possible, cache your table. Instructions count may be incorrect due to my knowledge in assembler.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Cached table for all insertion0.18031 sec(s)0.169690.273240.18397(100%)
2Table constructor for each insertion0.37549 sec(s)0.319352.90340.84225(208.24%) (2 times slower)
Conclusion:
If possible, cache your table.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Cached table for all insertion0.485 sec(s)0.4490.6240.49238(100%)
2Table constructor for each insertion2.349 sec(s)2.233.4612.42184(484.32%) (4 times slower)
Conclusion:
If possible, cache your table.
15. String split (by character)🔗︎
Predefines:
local text = "Hello, this is an example text" local cstring = ffi.cast("const char*", text) local char = string.char local sub, gsub, gmatch = string.sub, string.gsub, string.gmatch local gsubfunc = function(s) local x = s end
Code 1:
for i = 1, #text do local x = sub(text, i, i) end
Code 2:
for k in gmatch(text, ".") do local x = k end
Code 3:
gsub(text, ".", gsubfunc)
Code 4 (FFI):
for i = 0, #text - 1 do local x = char(cstring[i]) end
Results (10M iterations):
Assembler Results:
  1. sub(i,i): 49 instructions total.
  2. gmatch: 114 instructions total. NYI on LuaJIT 2.0Stitches on LuaJIT 2.1
  3. gsub: 65 instructions total. NYI on LuaJIT 2.0Stitches on LuaJIT 2.1
  4. (FFI) const char indexing: 48 instructions total. NYI on LuaJIT 2.0
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1sub0.03063 sec(s)0.02570.065350.03253(114.12%)
2gmatch1.66512 sec(s)1.61472.32341.75248(6203.87%) (62 times slower)
3gsub2.28969 sec(s)2.217682.778742.32719(8530.88%) (85 times slower)
4(FFI) const char indexing0.02684 sec(s)0.025520.032120.02705(100%)
Conclusion:
If you're using FFI on LuaJIT 2.1.0 and higher, splitting will be the fastest. Probably you wouldn't need to split it because ffi arrays are mutable, so all text manipulations can be done directly. Otherwise use string.sub. It's recommended to use string.find, string.match, etc if possible. Splitting each char wastes GC.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1sub0.70481 sec(s)0.683311.257680.75732(100%)
2gmatch1.4904 sec(s)1.448312.058461.53365(211.46%) (2 times slower)
3gsub2.12422 sec(s)2.072812.614942.16115(301.38%) (3 times slower)
4(FFI) const char indexing:2.31658 sec(s)2.185993.026382.35951(328.68%) (3 times slower)
Conclusion:
Use string.sub. It's recommended to use string.find, string.match, etc if possible. Splitting each char wastes GC.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1sub1.6025 sec(s)1.5582.2581.68562(100%)
2gmatch2.157 sec(s)2.0922.3942.16154(134.60%)
3gsub2.6765 sec(s)2.2733.1312.57897(167.02%)
Conclusion:
Use string.sub. It's recommended to use string.find, string.match, etc if possible. Splitting each char wastes GC.
16. Empty string check🔗︎
Predefines:
local s = "" local cstring = ffi.cast("const char*", s) ffi.cdef([[ size_t strlen ( const char * str ); ]]) local C = ffi.C
Code 1:
local x = #s == 0
Code 2:
local x = s == ""
Code 3 (FFI):
local x = cstring[0] == 0
Code 4 (FFI):
local x = C.strlen(cstring) == 0
Results (10M iterations):
Assembler Results:
  1. #s == 0: 18 instructions total.
  2. s == "": 18 instructions total.
  3. cstring[0] == 0: 21 instructions total.
  4. C.strlen(cstring) == 0: 59 instructions total.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1#s == 0 and s == ""0.00328 sec(s)0.003080.013920.00364(100%)
2cstring[0] == 00.00362 sec(s)0.003070.006990.00391(110.36%)
3C.strlen(cstring) == 00.02658 sec(s)0.023980.044050.02779(810.36%) (8 times slower)
Conclusion:
If you're using FFI, use Lua syntax to check empty string.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1#s == 00.17336 sec(s)0.164050.271690.18419(125.75%)
2s == ""0.13785 sec(s)0.132670.188840.1399(100%)
3cstring[0] == 00.66383 sec(s)0.648880.73670.66915(481.55%) (4 times slower)
4C.strlen(cstring) == 02.19199 sec(s)2.13182.522412.19931(1590.12%) (15 times slower)
Conclusion:
If you're using FFI, use Lua syntax to check empty string.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1#s == 00.4685 sec(s)0.4560.6490.48129(107.82%)
2s == ""0.4345 sec(s)0.4120.5450.44393(100%)
Conclusion:
String comparison is a little bit faster than length comparison.
17. C array size (FFI)🔗︎
Predefines:
new = ffi.new
Code 1:
new("const char*[16]") new("const char*[1024]") new("int[16]") new("int[1024]")
Code 2:
new("const char*[?]", 16) new("const char*[?]", 1024) new("int[?]", 16) new("int[?]", 1024)
Results (1M iterations):
Assembler Results:
  1. [n]: 113 instructions total. NYI on LuaJIT 2.0
  2. VLA: 105 instructions total. NYI on LuaJIT 2.0
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1[n]2.64742 sec(s)2.206944.075162.68361(105.73%)
2VLA2.50381 sec(s)2.015463.855972.47497(100%)
Conclusion:
For some reason LuaJIT 2.0 is not able to compile any C type. Use VLA if possible.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1[n]4.32618 sec(s)3.74425.669794.37286(102.28%)
2VLA4.22957 sec(s)3.543165.776514.20961(100%)
Conclusion:
Use VLA if possible.
18. String concatenation🔗︎
Predefines:
local bs = string.rep("----------", 1000) local t = {bs, bs, bs, bs, bs, bs, bs, bs, bs, bs} local concat = table.concat local format = string.format
Code 1:
local s = bs .. bs .. bs .. bs .. bs .. bs .. bs .. bs .. bs .. bs
Code 2:
local s = bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs
Code 3:
local s = bs for i = 1, 9 do s = s .. bs end
Code 4:
concat(t)
Code 5:
format("%s%s%s%s%s%s%s%s%s%s", bs, bs, bs, bs, bs, bs, bs, bs, bs, bs)
Results (100k iterations):
Assembler Results:
  1. Inline concat: 18 instructions total. NYI on LuaJIT 2.0
  2. Separate concat: 18 instructions total. NYI on LuaJIT 2.0
  3. Loop concat: 94 instructions total. NYI on LuaJIT 2.0
  4. table.concat: 39 instructions total. NYI on LuaJIT 2.0
  5. string.format: 18 instructions total. NYI on LuaJIT 2.0
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Inline, separate concat and string.format0.00003 sec(s)0.000030.004150.00012(0.009986%) (10014 times faster)
2Loop concat6.70725 sec(s)5.609638.01016.57035(2232.55%) (22 times slower)
3table.concat0.30043 sec(s)0.264920.378150.30172(100%)
Conclusion:
Please notice that percentage calculation is taken from the other result. This is an example when LuaJIT fails to optimize and compile code efficiently. The loop wasn't unrolled properly. LuaJIT suggest to find a balance between loops and unrolls and use templates. table.concat is best solution in complicated code, however, if it's possible make concats inline or unroll loops.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Inline concat1.44256 sec(s)1.426741.761831.46447(100%)
2Separate concat5.82289 sec(s)5.446717.763315.9645(403.64%) (4 times slower)
3Loop concat6.61971 sec(s)5.709447.647076.6218(458.88%) (4 times slower)
4table.concat1.49022 sec(s)1.418491.950121.56112(103.30%)
5string.format1.46481 sec(s)1.427732.050971.52796(101.54%)
Conclusion:
If it's possible inline your concats, otherwise use table.concat.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1Inline concat1.023 sec(s)1.011.2961.04552(100%)
2Separate concat10.445 sec(s)9.91812.90910.63149(1021.01%) (10 times slower)
3Loop concat11.723 sec(s)9.91914.47211.64345(1145.94%) (11 times slower)
4table.concat2.151 sec(s)2.0832.3782.16366(210.26%) (2 times slower)
5string.format2.179 sec(s)2.1163.0992.26572(213%) (2 times slower)
Conclusion:
If it's possible inline your concats, otherwise use table.concat.
19. String in a function🔗︎
Predefines:
local TYPE_bool = "bool" local type = type local function isbool1(b) return type(b) == "bool" end local function isbool2(b) return type(b) == TYPE_bool end
Code 1:
isbool1(false)
Code 2:
isbool2(false)
Results (10M iterations):
Assembler Results:
  1. KGC string: 18 instructions total.
  2. Upvalued string: 18 instructions total.
Conclusion:
LuaJIT compiles them with the same performance.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1KGC string0.39173 sec(s)0.376980.631590.41579(100%)
2Upvalued string0.40781 sec(s)0.39340.518130.4151(104.10%)
Conclusion:
If possible use literal strings in the function.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1KGC string1.324 sec(s)1.261.991.37005(100%)
2Upvalued string1.3915 sec(s)1.2681.7731.40522(105.09%)
Conclusion:
If possible use literal strings in the function.
20. Taking a value from a function with multiple returns🔗︎
Predefines:
local function funcmret() return 1, 2 end local select = select
Code 1:
local _, arg2 = funcmret() return arg2
Code 2:
local arg2 = select(2, funcmret()) return arg2
Results (10M iterations):
Assembler Results:
  1. With dummy variables: 18 instructions total.
  2. select: 18 instructions total.
Conclusion:
LuaJIT compiles them with the same performance.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1With dummy variables0.25193 sec(s)0.245680.275750.25267(100%)
2select0.38455 sec(s)0.374980.43970.38579(152.63%)
Conclusion:
select makes no sense for functions with less than 10 (at least) returned values, all returned values are pushed to the stack. Any value you choose will can be pushed up individually. Tip: if you need only first argument wrap the function call in the parenthesizes.
print( (math.frexp(0)) )
This will print only the first value.
Benchmark Results:
#NameMedianMinimumMaximumAveragePercentage
1With dummy variables0.611 sec(s)0.60.7020.61562(100%)
2select0.813 sec(s)0.7860.9260.81984(133.06%)
Conclusion:
select makes no sense for functions with less than 10 (at least) returned values, all returned values are pushed to the stack. Any value you choose will can be pushed up individually. Tip: if you need only first argument wrap the function call in the parenthesizes.
print( (math.frexp(0)) )
This will print only the first value.
Up
analytics Made by Spar (Spar#6665) New benchmark tests are welcome Public Domain 2020