LuaJIT Benchmark Tests

This site contains results and conclusions for LuaJIT compiler, LuaJIT interpreter and Lua 5.1.4. LuaJIT interpreter is accounted because it's a useful information for functions in which you 100% sure they won't compile. Or maybe you're using embedded LuaJIT 2.0 which aborts on any C function (And FFI is disabled). Lua 5.1 is accounted for you decision in what to choose, or just out of curiosity. First 14 benchmark tests were taken from this page New benchmark tests are welcome. Specs: Intel i5-6500 3.20 GHz. 64-bit. LuaJIT 2.1.0-beta3. (Lua 5.1.4 for plain Lua tests) (LuaJIT 2.0.4 for LuaJIT 2.0 assembler tests) (JIT: ON SSE2 SSE3 SSE4.1 BMI2 fold cse dce fwd dse narrow loop abc sink fuse)

Predefines:

local t = type

Code 1:

type(3)

Code 2:

t(3)

Results (10M iterations):

Assembler Results:

Global: 29 instructions total.
Local: 18 instructions total.
Diff

mov dword [0x3e2d0410], 0x1 movsd xmm7, [rdx+0x40] cvttsd2si eax, xmm7 xorps xmm6, xmm6 cvtsi2sd xmm6, eax ucomisd xmm7, xmm6 jnz 0x7ffa543c0010 ->0 jpe 0x7ffa543c0010 ->0 cmp eax, 0x7ffffffe jg 0x7ffa543c0010 ->0 cvttsd2si edi, [rdx+0x38] cmp dword [rdx+0x14], -0x09 jnz 0x7ffa543c0010 ->0 cmp dword [rdx+0x10], 0x3e2d8228 - jnz 0x7ffa543c0010 ->0 - mov edx, [0x3e2d8230] - cmp dword [rdx+0x1c], +0x3f - jnz 0x7ffa543c0010 ->0 - mov ecx, [rdx+0x14] - mov rsi, 0xfffffffb3e2d2f88 - cmp rsi, [rcx+0x5a8] - jnz 0x7ffa543c0010 ->0 - cmp dword [rcx+0x5a4], -0x09 - jnz 0x7ffa543c0010 ->0 - cmp dword [rcx+0x5a0], 0x3e2d2ef0 jnz 0x7ffa543c0010 ->0 add edi, +0x01 cmp edi, eax jg 0x7ffa543c0014 ->1

Conclusion:

Each global lookup can cost around 11 instructions. They both run almost on the same speed, but this benchmark tests only one global.
This is still a good practice to localize all variables you need.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Global	0.24571 sec(s)	0.23929	0.29617	0.24856	(102.83%)
2	Local	0.23894 sec(s)	0.22918	0.32741	0.24434	(100%)

Conclusion:

JIT matches the performance of globals and upvalues. This is a good practice to localize all variables you need, upvalues are still faster.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Global	0.9605 sec(s)	0.937	1.075	0.97355	(111.42%)
2	Local	0.862 sec(s)	0.845	0.939	0.86418	(100%)

Conclusion:

Upvalues are faster than globals.

Predefines:

local s = math.sin

Code 1:

math.sin(3.14)

Code 2:

s(3.14)

Results (10M iterations):

Assembler Results:

Global table indexing: 38 instructions total.
Local: 18 instructions total.
Diff

mov dword [0x24660410], 0x1 movsd xmm7, [rdx+0x40] cvttsd2si eax, xmm7 xorps xmm6, xmm6 cvtsi2sd xmm6, eax ucomisd xmm7, xmm6 jnz 0x7ffa543c0010 ->0 jpe 0x7ffa543c0010 ->0 cmp eax, 0x7ffffffe jg 0x7ffa543c0010 ->0 cvttsd2si edi, [rdx+0x38] cmp dword [rdx+0x14], -0x09 - jnz 0x7ffa543c0010 ->0 - cmp dword [rdx+0x10], 0x2467f788 - jnz 0x7ffa543c0010 ->0 - mov ebp, [0x2467f790] - cmp dword [rbp+0x1c], +0x3f - jnz 0x7ffa543c0010 ->0 - mov ebx, [rbp+0x14] - mov rsi, 0xfffffffb24665fd8 - cmp rsi, [rbx+0x518] - jnz 0x7ffa543c0010 ->0 - cmp dword [rbx+0x514], -0x0c jnz 0x7ffa543c0010 ->0 + cmp dword [rdx+0x18], 0x2466aca0 - mov edx, [rbx+0x510] - cmp dword [rdx+0x1c], +0x1f - jnz 0x7ffa543c0010 ->0 - mov ecx, [rdx+0x14] - mov rsi, 0xfffffffb24666548 - cmp rsi, [rcx+0x230] - jnz 0x7ffa543c0010 ->0 - cmp dword [rcx+0x22c], -0x09 - jnz 0x7ffa543c0010 ->0 - cmp dword [rcx+0x228], 0x24666520 jnz 0x7ffa543c0010 ->0 add edi, +0x01 cmp edi, eax jg 0x7ffa543c0014 ->1

Conclusion:

As the first test concluded, each table indexing can cost around 11 additional instructions. Localizing math table won't help much. Localize your variables.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Global table indexing	0.36948 sec(s)	0.36357	0.3908	0.37024	(146.27%)
2	Local	0.25259 sec(s)	0.24956	0.26611	0.25344	(100%)

Conclusion:

Localizing exact value will get you more performance.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Global table indexing	0.9335 sec(s)	0.884	1.039	0.93893	(120.84%)
2	Local	0.7725 sec(s)	0.733	0.889	0.77743	(100%)

Conclusion:

Localizing exact value will get you more performance.

Predefines:

local class = { test = function() return 1 end }

Code 1:

class.test() class.test() class.test()

Code 2:

local test = class.test test() test() test()

Results (10M iterations):

Assembler Results:

Direct call: 35 instructions total.
Localized call: 35 instructions total.

Conclusion:

LuaJIT compiles them with the same performance.
However, LuaJIT suggests not to second-guess the JIT compiler, because unnecessary localization can create more complicated code.
Localizing local c = a+b for z = x[a+b] + y[a+b] is redundant. JIT perfectly compiles such code as a[i][j] = a[i][j] * a[i][j+1].

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Direct call	0.5611 sec(s)	0.55	0.6099	0.56474	(120.64%)
2	Localized call	0.46508 sec(s)	0.4563	0.5834	0.46996	(100%)

Conclusion:

Unlike JIT compiler, JIT interpreter still runs faster with localized functions due to MOV instruction.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Direct call	1.6065 sec(s)	1.516	1.843	1.62501	(120.38%)
2	Localized call	1.3345 sec(s)	1.297	1.647	1.35458	(100%)

Conclusion:

Localized function speeds up the code due to MOV instruction.

Predefines:

local min = math.min local unpack = unpack local a = {100, 200, 300, 400} local function unpack4(a) return a[1], a[2], a[3], a[4] end

Code 1:

min(a[1], a[2], a[3], a[4])

Code 2:

min(unpack(a))

Code 3:

min(unpack4(a))

Results (100M iterations):

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Indexing and unpack4	0.03403 sec(s)	0.0316	0.05607	0.03531	(100%)
2	unpack	4.71555 sec(s)	4.54112	5.6054	4.75909	(13858.80%) (138 times slower)

Assembler Results:

Indexing: 36 instructions total.
unpack: 46 instructions total. NYI on LuaJIT 2.0 Fallbacks to interpreter on LuaJIT 2.1
Diff

mov dword [0x30100410], 0x1 + movsd xmm6, [0x30149c60] movsd xmm7, [rdx+0x58] cvttsd2si eax, xmm7 xorps xmm6, xmm6 cvtsi2sd xmm6, eax ucomisd xmm7, xmm6 jnz 0x7ffa538b0010 ->0 jpe 0x7ffa538b0010 ->0 cmp eax, 0x7ffffffe jg 0x7ffa538b0010 ->0 cvttsd2si edi, [rdx+0x50] cmp dword [rdx+0x2c], -0x09 jnz 0x7ffa538b0010 ->0 cmp dword [rdx+0x28], 0x3010d498 jnz 0x7ffa538b0010 ->0 mov ebx, [0x30110c60] add ebx, -0x18 cmp ebx, edx jnz 0x7ffa538b0010 ->0 cmp dword [rdx+0x1c], -0x0c jnz 0x7ffa538b0010 ->0 - mov edx, [rdx+0x18] - cmp dword [rdx+0x18], +0x04 - jbe 0x7ffa538b0010 ->0 - mov ecx, [rdx+0x8] - cmp dword [rcx+0xc], 0xfffeffff - jnb 0x7ffa538b0010 ->0 - cmp dword [rcx+0x14], 0xfffeffff - jnb 0x7ffa538b0010 ->0 - cmp dword [rcx+0x1c], 0xfffeffff - jnb 0x7ffa538b0010 ->0 - cmp dword [rcx+0x24], 0xfffeffff - jnb 0x7ffa538b0010 ->0 - add edi, +0x01 - cmp edi, eax - jg 0x7ffa538b0014 ->1 + xorps xmm7, xmm7 + cvtsi2sd xmm7, esi + mov eax, [0x301004b0] + mov eax, [rax+0x20] + sub eax, edx + cmp eax, 0xa0 + jb 0x7ffa538b0014 ->1 + mov dword [rdx+0x94], 0xfffffff4 + mov [rdx+0x90], edi + mov dword [rdx+0x8c], 0x30110bdc + mov dword [rdx+0x88], 0x301032c8 + mov dword [rdx+0x84], 0xfffffff7 + mov dword [rdx+0x80], 0x30106ae0 + movsd [rdx+0x78], xmm6 + mov dword [rdx+0x74], 0x30111748 + mov dword [rdx+0x70], 0x3010c7a8 + movsd [rdx+0x68], xmm7 + movsd [rdx+0x50], xmm7 + add edx, 0x90 + mov eax, 0x2 + mov esi, 0x301004a8 + mov ebx, 0x30100fe0 + jmp 0x7ffa297d43c1
unpack4: 36 instructions total.

Conclusion:

Avoid using unpack for small table with known size. As an alternative you can use this function:do
	local concat = table.concat
	local loadstring = loadstring

	function createunpack(n)
		local ret = {"local t = ... return "}

		for k = 1, n do
			ret[1 + (k-1) * 4] = "t["
			ret[2 + (k-1) * 4] = k
			ret[3 + (k-1) * 4] = "]"
			if k ~= n then ret[4 + (k-1) * 4] = "," end
		end

		return loadstring(concat(ret))
	end
end
This function has 1 limitation. The maximum number of returned values is 248. The limit of LuaJIT unpack function is 7999 with default settings.
At least createunpack can create JIT-compiled unpack (unpack4 is basically createunpack(4))

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Indexing	3.73678 sec(s)	3.60006	4.61773	3.78408	(100%)
2	unpack	5.56231 sec(s)	5.12473	6.89518	5.69063	(148.85%)
3	unpack4	4.17394 sec(s)	4.12066	4.90567	4.34065	(111.69%)

Conclusion:

Avoid using unpack for small table with known size. As an alternative you can use the function mentioned on LuaJIT tab.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Indexing	12.272 sec(s)	11.929	14.207	12.38047	(115.97%)
2	unpack	10.5815 sec(s)	10	11.586	10.56572	(100%)
3	unpack4	14.855 sec(s)	14.491	18.836	15.07444	(140.38%)

Conclusion:

Any method is ok, unpack4 is the slowest probably because of the function call overhead.

Predefines:

local max = math.max local num = 100 local y = 0

Code 1:

local x = max(num, y)

Code 2:

if (num > y) then local x = num end

Code 3:

local x = num > y and num or x

Results (10M iterations):

Assembler Results:

math.min: 18 instructions total.
if (num > y) then: 18 instructions total.
a and b or c: 18 instructions total.

Conclusion:

LuaJIT compiles them with the same performance.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	math.min	0.23708 sec(s)	0.22686	0.28223	0.23841	(147.10%)
2	if (num > y) then	0.16116 sec(s)	0.15814	0.169	0.16173	(100%)
3	a and b or c	0.18716 sec(s)	0.18291	0.19907	0.18795	(116.13%)

Conclusion:

math.min has a function overhead, which probably makes it slower.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	math.min	0.647 sec(s)	0.621	0.731	0.65725	(134.93%)
2	if (num > y) then	0.4795 sec(s)	0.464	0.56	0.48323	(100%)
3	a and b or c	0.528 sec(s)	0.501	0.726	0.54099	(110.11%)

Conclusion:

math.min has a function overhead, which probably makes it slower.

Predefines:

local y

Code 1:

if not y then local x = 1 else local x = y end

Code 2:

local x = y or 1

Results (10M iterations):

Assembler Results:

if: 24 instructions total.
a or b: 24 instructions total.

Conclusion:

LuaJIT compiles them with the same performance.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	if	0.13572 sec(s)	0.13033	0.19183	0.13831	(100%)
2	a or b	0.13608 sec(s)	0.12298	0.23668	0.13989	(100.26%)

Conclusion:

a or b should be faster due to unary test and copy instructions ISTC and ISFC.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	if	0.398 sec(s)	0.382	0.484	0.40146	(112.11%)
2	a or b	0.355 sec(s)	0.349	0.367	0.35508	(100%)

Conclusion:

a or b should be faster due to TESTSET instruction.

Predefines:

local x = 10 local pow = math.pow

Code 1:

local y = x ^ 2

Code 2:

local y = x * x

Code 3:

local y = pow(x, 2)

Results (10M iterations):

Assembler Results:

x ^ 2: 18 instructions total.
x * x: 18 instructions total.
math.pow: 18 instructions total.

Conclusion:

LuaJIT compiles them with the same performance.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	x ^ 2	0.60192 sec(s)	0.5671	0.85234	0.61451	(442.13%) (4 times slower)
2	x * x	0.13614 sec(s)	0.13237	0.19182	0.13845	(100%)
3	math.pow	0.69741 sec(s)	0.59753	0.95067	0.70204	(512.26%) (5 times slower)

Conclusion:

Use multiply instead of power if you know the exact exponent. math.pow has a function overhead.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	x ^ 2	1.044 sec(s)	1.023	1.242	1.05641	(269.07%) (2 times slower)
2	x * x	0.388 sec(s)	0.376	0.456	0.39083	(100%)
3	math.pow	1.3125 sec(s)	1.211	1.529	1.32576	(338.27%) (3 times slower)

Conclusion:

Use multiply instead of power if you know the exact exponent. math.pow has a function overhead.

Predefines:

local fmod = math.fmod local function jit_fmod(a, b) if b < 0 then b = -b end if a < 0 then return -(-a % b) else return a % b end end

Code 1:

local x = fmod(times, 30)

Code 2:

local x = (times % 30)

Code 3:

local x = jit_fmod(times, 30)

Results (10M iterations):

Assembler Results:

fmod: 55 instructions total. NYI on LuaJIT 2.0 Stitches on LuaJIT 2.1
%: 18 instructions total.
JITed fmod: 20 instructions total.
Diff

mov dword [0x2add0410], 0x4 movsd xmm7, [rdx+0x48] cvttsd2si eax, xmm7 xorps xmm6, xmm6 cvtsi2sd xmm6, eax ucomisd xmm7, xmm6 jnz 0x7ffa543c0010 ->0 jpe 0x7ffa543c0010 ->0 cmp eax, 0x7ffffffe jg 0x7ffa543c0010 ->0 cvttsd2si edi, [rdx+0x40] cmp dword [rdx+0x24], -0x09 jnz 0x7ffa543c0010 ->0 cmp dword [rdx+0x20], 0x2adf2c00 jnz 0x7ffa543c0010 ->0 + test edi, edi + jl 0x7ffa543c0014 ->1 add edi, +0x01 cmp edi, eax jg 0x7ffa543c0018 ->2

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	fmod	0.2961 sec(s)	0.2885	0.4984	0.30364	(7670.98%) (76 times slower)
2	% and JITed fmod	0.00386 sec(s)	0.00305	0.00643	0.00401	(100%)

Conclusion:

Use % for positive modulo. For negative or mixed modulo use JITed fmod.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	fmod	0.33687 sec(s)	0.3147	0.42487	0.34024	(239.59%) (2 times slower)
2	%	0.1406 sec(s)	0.13166	0.21037	0.14226	(100%)
3	JITed fmod	0.35584 sec(s)	0.34378	0.52199	0.36319	(253.07%) (2 times slower)

Conclusion:

JITed fmod solves compilation problem but it's slower in interpreter mode

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	fmod	0.7055 sec(s)	0.657	0.842	0.7135	(182.77%)
2	%	0.386 sec(s)	0.374	0.471	0.39029	(100%)
3	JITed fmod	0.858 sec(s)	0.812	1.127	0.8753	(222.27%) (2 times slower)

Conclusion:

JITed fmod is not recommended for plain Lua. Use module operator for positive numbers and math.fmod for negative and mixed.

Predefines:

local func1 = function(a, b, func) return func(a + b) end local func2 = function(a) return a * 2 end

Code 1:

local x = func1(1, 2, function(a) return a * 2 end)

Code 2:

local x = func1(1, 2, func2)

Results (10M iterations):

Assembler Results:

Function in argument: NYI on LuaJIT 2.1
Localized function: 18 instructions total.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Function in argument	0.61696 sec(s)	0.57184	0.81876	0.62018	(18791.11%) (187 times slower)
2	Localized function	0.00328 sec(s)	0.00307	0.00733	0.00354	(100%)

Conclusion:

If it's possible, localize your function and re-use it. If you need to provide a local to the closure try different approach of passing values. Simple example is changing state iterator to stateless.
Example of different value passing:function func()
	local a, b = 50, 10

	timer.Simple(5, function()
		print(a + b)
	end)
end
In this example timer.Simple can't pass arguments to the function, we can change the style of value passing from function upvalues to main chunk upvalues:local Ua, Ub

local function printAplusB()
	print(Ua + Ub)
end

function func()
	local a, b = 50, 10
	Ua, Ub = a, b
	timer.Simple(5, printAplusB)
end
Moving function outside allows to compile func.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Function in argument	0.61991 sec(s)	0.58707	0.88968	0.63932	(166.80%)
2	Localized function	0.37165 sec(s)	0.33582	0.4346	0.37056	(100%)

Conclusion:

If it's possible, localize your function and re-use it. See a possible solution on LuaJIT tab.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Function in argument	1.7375 sec(s)	1.646	2.083	1.75102	(185.63%)
2	Localized function	0.936 sec(s)	0.915	1.028	0.94642	(100%)

Conclusion:

If it's possible, localize your function and re-use it. See a possible solution on LuaJIT tab.

Predefines:

local a = {} for i = 1, 100 do a[i] = i end a.n = 100 a[0] = 100 local length = #a local nxt = next function jit_pairs(t) return nxt, t end

Code 1:

for k, v in pairs(a) do local x = v end

Code 2 (Using JITed next on 2.1.0-beta2):

for k, v in jit_pairs(a) do local x = v end

Code 3:

for k, v in ipairs(a) do local x = v end

Code 4:

for i = 1, 100 do local x = a[i] end

Code 5:

for i = 1, #a do local x = a[i] end

Code 6:

for i = 1, length do local x = a[i] end

Code 7:

for i = 1, a.n do local x = a[i] end

Code 8:

for i = 1, a[0] do local x = a[i] end

Results (1M iterations):

May be incorrect. Awaits recalculation.

Assembler Results:

pairs: NYI on LuaJIT 2.1
JITed pairs: 119 instructions total.
Known length: 56 instructions total.
ipairs: 104 instructions total.
#a: 78 instructions total.
Upvalued length: 60 instructions total.
a.n: 89 instructions total.
a[0]: 80 instructions total.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	pairs	0.51975 sec(s)	0.48428	0.63495	0.52525	(757.87%) (7 times slower)
2	JITed pairs	0.41983 sec(s)	0.39041	0.52863	0.41906	(612.17%) (6 times slower)
3	ipairs	0.12707 sec(s)	0.12164	0.20861	0.13086	(185.28%)
4	Known length	0.11527 sec(s)	0.11252	0.15329	0.1175	(168.08%)
5	#a	0.12063 sec(s)	0.10235	0.17138	0.1199	(175.89%)
6	Upvalued length	0.08333 sec(s)	0.07875	0.17807	0.08744	(121.50%)
7	a.n	0.08724 sec(s)	0.08448	0.10026	0.08813	(127.20%)
8	a[0]	0.06858 sec(s)	0.06673	0.09873	0.07049	(100%)

Conclusion:

a[0] or a.n are the best solution you can use. (If you have table.pack you may remember it creates a sequential table and adds n with the size of the created table this can be used for iteration)
JITed pairs is still slow but it will compile.

The results of this test for LuaJIT interpreter are confusing.
They were verified many times. Current goal is to email Mike Pall about these results and ask why are they so different.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	pairs	0.51711 sec(s)	0.48241	0.67666	0.5224	(100%)
2	JITed pairs	1.80467 sec(s)	1.62461	2.02158	1.77821	(348.99%) (3 times slower)
3	ipairs	1.70326 sec(s)	1.64163	2.1924	1.72125	(329.38%) (3 times slower)
4	Known length	0.67382 sec(s)	0.6603	0.85948	0.68079	(130.30%)
5	#a	0.6967 sec(s)	0.68416	0.74215	0.70065	(134.72%)
6	Upvalued length	0.67209 sec(s)	0.6611	0.77354	0.67794	(129.97%)
7	a.n	0.69201 sec(s)	0.66747	1.00413	0.7115	(133.82%)
8	a[0]	0.6715 sec(s)	0.66014	0.77048	0.67611	(129.85%)

Conclusion:

These results requires an explanation, no conclusion can be made.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	pairs	3.5325 sec(s)	3.241	3.968	3.51657	(193.03%)
2	ipairs	3.226 sec(s)	3.059	4.155	3.24595	(176.28%)
3	Known length	1.83 sec(s)	1.753	2.005	1.83169	(100%)
4	#a	1.8305 sec(s)	1.755	2.114	1.84612	(100.02%)
5	Upvalued length	1.8775 sec(s)	1.794	2.452	1.9197	(102.59%)
6	a.n	1.8815 sec(s)	1.773	2.248	1.89361	(102.81%)
7	a[0]	1.841 sec(s)	1.779	2.197	1.86906	(100.60%)

Conclusion:

a[0] and a.n are fast as in compiled LuaJIT.

Predefines:

local a = {} for i = 1, 100 do a[i] = { x = 10 } end

Code 1:

for n = 1, 100 do a[n].x = a[n].x + 1 end

Code 2:

local a = a for n = 1, 100 do local y = a[n] y.x = y.x + 1 end

Results (10M iterations):

Assembler Results:

No localization: 64 instructions total.
Localized a and a[n]: 64 instructions total.
Diff

mov dword [0x29550410], 0x4 mov edx, [0x295504b4] movsd xmm15, [0x295807c0] movsd xmm14, [0x29580790] movsd xmm6, [0x29580780] cmp dword [rdx-0x4], 0x2955bb3c jnz 0x7ffa5abd0014 ->1 add edx, -0x60 mov [0x295504b4], edx movsd xmm13, [rdx+0x40] movsd xmm7, [rdx+0x38] addsd xmm7, xmm15 ucomisd xmm13, xmm7 jb 0x7ffa5abd0018 ->2 cmp dword [rdx+0x14], -0x09 + jnz 0x7ffa5abd001c ->3 + cmp dword [rdx+0x18], 0x2959ab30 + jnz 0x7ffa5abd001c ->3 + mov edi, [0x2959ab20] + add edi, -0x08 + cmp edi, edx jnz 0x7ffa5abd001c ->3 cmp dword [rdx+0x10], 0x2959aaf0 jnz 0x7ffa5abd001c ->3 + mov edi, [rdx+0x8] + movsd [rdx+0x88], xmm15 movsd [rdx+0x80], xmm15 movsd [rdx+0x78], xmm15 movsd [rdx+0x70], xmm14 - movsd [rdx+0x68], xmm15 + mov dword [rdx+0x6c], 0xfffffff4 + mov [rdx+0x68], edi movsd [rdx+0x60], xmm6 mov dword [rdx+0x5c], 0x2955bb3c mov dword [rdx+0x58], 0x2959aaf0 movsd [rdx+0x38], xmm7 add edx, +0x60 mov [0x295504b4], edx jmp 0x7ffa5abdfdc1 mov dword [0x29550410], 0x2 movsd xmm0, [0x29592110] cvttsd2si edi, [rdx+0x8] - mov r10d, [rdx-0x8] - mov esi, [r10+0x14] - mov r9d, [rsi+0x10] - mov ebx, r9d - sub ebx, edx - cmp ebx, +0x30 - jbe 0x7ffa5abd0010 ->0 cmp dword [r9+0x4], -0x0c jnz 0x7ffa5abd0010 ->0 mov r8d, [r9] cmp dword [r8+0x18], +0x64 jbe 0x7ffa5abd0010 ->0 mov eax, [r8+0x8] cmp dword [rax+rdi*8+0x4], -0x0c jnz 0x7ffa5abd0010 ->0 mov edx, [rax+rdi*8] - cmp ebx, +0x38 - jbe 0x7ffa5abd0010 ->0 cmp dword [rdx+0x1c], +0x01 jnz 0x7ffa5abd0010 ->0 mov ecx, [rdx+0x14] mov rsi, 0xfffffffb2955a520 cmp rsi, [rcx+0x20] jnz 0x7ffa5abd0010 ->0 cmp dword [rcx+0x1c], 0xfffeffff jnb 0x7ffa5abd0010 ->0 movsd xmm1, [rcx+0x18] addsd xmm1, xmm0 movsd [rcx+0x18], xmm1 add edi, +0x01 cmp edi, +0x64 jg 0x7ffa5abd0014 ->1

Conclusion:

You may localize your values for interpreter.
However, LuaJIT suggests not try to second-guess the JIT compiler because in compiled code locals and upvalues are used directly by their reference pointer, making over-localization may complicate the compiled code.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	No localization	19.37724 sec(s)	19.18286	21.82537	19.48628	(137.26%)
2	Localized a and a[n]	14.11709 sec(s)	13.8045	17.69585	14.28717	(100%)

Conclusion:

If your code can't compile, localization is best you can do here.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	No localization	58.832 sec(s)	56.511	70.485	60.3759	(120.64%)
2	Localized a and a[n]	48.7655 sec(s)	41.021	53.683	46.57903	(100%)

Conclusion:

Localization speeds up the code.

Predefines:

local a = { [0] = 0, n = 0 } local tinsert = table.insert local count = 1 -- Note: after each run of the code the table and count variable are restored to predefined state. -- If you don't clean them after a test, table.insert will be super slow.

Code 1:

tinsert(a, times)

Code 2:

a[times] = times

Code 3:

a[#a + 1] = times

Code 4:

a[count] = times count = count + 1

Code 5:

a.n = a.n + 1 a[a.n] = times

Code 6:

a[0] = a[0] + 1 a[a[0]] = times

Results (1M iterations):

Assembler Results:

tinsert: 65 instructions total.
a[times]: 62 instructions total.
a[#a + 1]: 72 instructions total.
a[count]: 78 instructions total.
a[a.n]: ~52
a[a[0]]: ~51

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	tinsert and a[#a + 1]	0.09972 sec(s)	0.09614	0.16774	0.10205	(1673.15%) (16 times slower)
2	a[times]	0.00596 sec(s)	0.00507	0.01528	0.00629	(100%)
3	a[count]	0.00655 sec(s)	0.00599	0.00806	0.00657	(109.89%)
4	a[a.n]	0.00689 sec(s)	0.006	0.00865	0.00696	(115.6%)
5	a[a[0]]	0.00833 sec(s)	0.00751	0.01167	0.00844	(139.76%)

Conclusion:

Using a local or a constant value is the fastest method.
If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1.
Instructions count may be incorrect due to my knowledge in assembler.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	tinsert	0.1522 sec(s)	0.14448	0.21487	0.15571	(112.44%)
2	a[times]	0.01899 sec(s)	0.01791	0.03054	0.0194	(14.03%) (7 times faster)
3	a[#a + 1]	0.13535 sec(s)	0.12965	0.17014	0.13644	(100%)
4	a[count]	0.0277 sec(s)	0.02617	0.03003	0.02779	(20.46%) (4 times faster)
5	a[a.n]	0.0368 sec(s)	0.03462	0.057	0.03752	(27.18%) (3 times faster)
6	a[a[0]]	0.0335 sec(s)	0.03114	0.04102	0.03386	(24.75%) (4 times faster)

Conclusion:

Please notice that percentage calculation is taken from the other result.
Using a local or a constant value is the fastest method. If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	tinsert	0.134 sec(s)	0.128	0.165	0.13653	(103.07%)
2	a[times]	0.06 sec(s)	0.057	0.066	0.06042	(46.15%) (2 times faster)
3	a[#a + 1]	0.13 sec(s)	0.125	0.162	0.13142	(100%)
4	a[count]	0.075 sec(s)	0.069	0.108	0.07713	(57.69%)
5	a[a.n]	0.188 sec(s)	0.179	0.245	0.19067	(144.61%)
6	a[a[0]]	0.255 sec(s)	0.246	0.292	0.25796	(196.15%)

Conclusion:

Please notice that percentage calculation is taken from the other result.
Using a local or a constant value is the fastest method. If not possible use external counter, otherwise use a.n++; a[a.n] = times or #a + 1.

Predefines:

local a require("table.new") local new = table.new local ffinew = ffi.new

Code 1:

local a = {} a[1] = 1 a[2] = 2 a[3] = 3

Code 2:

local a = {true, true, true} a[1] = 1 a[2] = 2 a[3] = 3

Code 3 (table.new is available since LuaJIT v2.1.0-beta1):

local a = new(3,0) a[1] = 1 a[2] = 2 a[3] = 3

Code 4:

local a = {1, 2, 3}

Code 5 (FFI):

local a = ffinew("int[3]", 1, 2, 3)

Code 6 (FFI):

local a = ffinew("int[3]") a[0] = 1 a[1] = 2 a[2] = 3

Results (10M iterations):

Assembler Results:

Allocated on demand: 96 instructions total.
Pre-allocated with dummy values: 18 instructions total.
Pre-allocated by table.new: 82 instructions total.
Defined in constructor: 18 instructions total.
(FFI) Defined in constructor: 18 instructions total.
(FFI) Defined after: 18 instructions total.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Allocated on demand	1.26337 sec(s)	1.24794	1.53786	1.2751	(39480.31%) (394 times slower)
2	Pre-allocated with dummy values	0.0032 sec(s)	0.00312	0.00358	0.00322	(100%)
3	Pre-allocated by table.new	0.41859 sec(s)	0.4055	0.49486	0.42476	(13080.93%) (130 times slower)
4	Defined in constructor	0.00325 sec(s)	0.00306	0.00411	0.00329	(101.56%)
5	(FFI) Defined in constructor	0.00325 sec(s)	0.0031	0.00425	0.00331	(101.56%)
6	(FFI) Defined after	0.00339 sec(s)	0.00312	0.00463	0.00351	(105.93%)

Conclusion:

Pre-allocation will speed up your code if you need more speed.
In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Allocated on demand	1.73737 sec(s)	1.7137	1.90643	1.74489	(310.27%) (3 times slower)
2	Pre-allocated with dummy values	0.61846 sec(s)	0.61472	0.6396	0.61924	(110.44%)
3	Pre-allocated by table.new	0.86155 sec(s)	0.81076	1.41348	0.86788	(153.86%)
4	Defined in constructor	0.55995 sec(s)	0.53821	0.63602	0.56426	(100%)
5	(FFI) Defined in constructor	3.09061 sec(s)	2.94983	3.91517	3.18377	(551.94%) (5 times slower)
6	(FFI) Defined after	4.46811 sec(s)	4.18024	5.32326	4.61457	(797.94%) (7 times slower)

Conclusion:

Pre-allocation will speed up your code if you need more speed.
In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand.
If you don't need to use FFI array don't use it for the CPU optimization (unless for RAM).

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Allocated on demand	5.304 sec(s)	5.243	5.694	5.32726	(196.88%)
2	Pre-allocated with dummy values	2.863 sec(s)	2.676	3.763	2.9231	(106.27%)
3	Defined in constructor	2.694 sec(s)	2.303	3.364	2.65954	(100%)

Conclusion:

Pre-allocation will speed up your code if you need more speed.
In 50% cases tables are used without pre-allocated space, so it's ok to allocate them on demand.

Predefines:

local T = {} local CachedTable = {"abc", "def", "ghk"}

Code 1:

T[times] = CachedTable

Code 2:

T[times] = {"abc", "def", "ghk"}

Results (10M iterations):

Assembler Results:

Cached table for all insertion: ~46
Table constructor for each insertion: ~50

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Cached table for all insertion	0.00881 sec(s)	0.00778	0.01554	0.00892	(100%)
2	Table constructor for each insertion	0.2196 sec(s)	0.19785	2.71365	0.33673	(2493.19%) (24 times slower)

Conclusion:

If possible, cache your table.
Instructions count may be incorrect due to my knowledge in assembler.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Cached table for all insertion	0.18031 sec(s)	0.16969	0.27324	0.18397	(100%)
2	Table constructor for each insertion	0.37549 sec(s)	0.31935	2.9034	0.84225	(208.24%) (2 times slower)

Conclusion:

If possible, cache your table.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Cached table for all insertion	0.485 sec(s)	0.449	0.624	0.49238	(100%)
2	Table constructor for each insertion	2.349 sec(s)	2.23	3.461	2.42184	(484.32%) (4 times slower)

Conclusion:

If possible, cache your table.

Predefines:

local text = "Hello, this is an example text" local cstring = ffi.cast("const char*", text) local char = string.char local sub, gsub, gmatch = string.sub, string.gsub, string.gmatch local gsubfunc = function(s) local x = s end

Code 1:

for i = 1, #text do local x = sub(text, i, i) end

Code 2:

for k in gmatch(text, ".") do local x = k end

Code 3:

gsub(text, ".", gsubfunc)

Code 4 (FFI):

for i = 0, #text - 1 do local x = char(cstring[i]) end

Results (10M iterations):

Assembler Results:

sub(i,i): 49 instructions total.
gmatch: 114 instructions total. NYI on LuaJIT 2.0 Stitches on LuaJIT 2.1
gsub: 65 instructions total. NYI on LuaJIT 2.0 Stitches on LuaJIT 2.1
(FFI) const char indexing: 48 instructions total. NYI on LuaJIT 2.0

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	sub	0.03063 sec(s)	0.0257	0.06535	0.03253	(114.12%)
2	gmatch	1.66512 sec(s)	1.6147	2.3234	1.75248	(6203.87%) (62 times slower)
3	gsub	2.28969 sec(s)	2.21768	2.77874	2.32719	(8530.88%) (85 times slower)
4	(FFI) const char indexing	0.02684 sec(s)	0.02552	0.03212	0.02705	(100%)

Conclusion:

If you're using FFI on LuaJIT 2.1.0 and higher, splitting will be the fastest.
Probably you wouldn't need to split it because ffi arrays are mutable, so all text manipulations can be done directly. Otherwise use string.sub.
It's recommended to use string.find, string.match, etc if possible. Splitting each char wastes GC.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	sub	0.70481 sec(s)	0.68331	1.25768	0.75732	(100%)
2	gmatch	1.4904 sec(s)	1.44831	2.05846	1.53365	(211.46%) (2 times slower)
3	gsub	2.12422 sec(s)	2.07281	2.61494	2.16115	(301.38%) (3 times slower)
4	(FFI) const char indexing:	2.31658 sec(s)	2.18599	3.02638	2.35951	(328.68%) (3 times slower)

Conclusion:

Use string.sub.
It's recommended to use string.find, string.match, etc if possible. Splitting each char wastes GC.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	sub	1.6025 sec(s)	1.558	2.258	1.68562	(100%)
2	gmatch	2.157 sec(s)	2.092	2.394	2.16154	(134.60%)
3	gsub	2.6765 sec(s)	2.273	3.131	2.57897	(167.02%)

Conclusion:

Use string.sub.
It's recommended to use string.find, string.match, etc if possible. Splitting each char wastes GC.

Predefines:

local s = "" local cstring = ffi.cast("const char*", s) ffi.cdef([[ size_t strlen ( const char * str ); ]]) local C = ffi.C

Code 1:

local x = #s == 0

Code 2:

local x = s == ""

Code 3 (FFI):

local x = cstring[0] == 0

Code 4 (FFI):

local x = C.strlen(cstring) == 0

Results (10M iterations):

Assembler Results:

#s == 0: 18 instructions total.
s == "": 18 instructions total.
cstring[0] == 0: 21 instructions total.
C.strlen(cstring) == 0: 59 instructions total.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	#s == 0 and s == ""	0.00328 sec(s)	0.00308	0.01392	0.00364	(100%)
2	cstring[0] == 0	0.00362 sec(s)	0.00307	0.00699	0.00391	(110.36%)
3	C.strlen(cstring) == 0	0.02658 sec(s)	0.02398	0.04405	0.02779	(810.36%) (8 times slower)

Conclusion:

If you're using FFI, use Lua syntax to check empty string.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	#s == 0	0.17336 sec(s)	0.16405	0.27169	0.18419	(125.75%)
2	s == ""	0.13785 sec(s)	0.13267	0.18884	0.1399	(100%)
3	cstring[0] == 0	0.66383 sec(s)	0.64888	0.7367	0.66915	(481.55%) (4 times slower)
4	C.strlen(cstring) == 0	2.19199 sec(s)	2.1318	2.52241	2.19931	(1590.12%) (15 times slower)

Conclusion:

If you're using FFI, use Lua syntax to check empty string.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	#s == 0	0.4685 sec(s)	0.456	0.649	0.48129	(107.82%)
2	s == ""	0.4345 sec(s)	0.412	0.545	0.44393	(100%)

Conclusion:

String comparison is a little bit faster than length comparison.

Predefines:

new = ffi.new

Code 1:

new("const char*[16]") new("const char*[1024]") new("int[16]") new("int[1024]")

Code 2:

new("const char*[?]", 16) new("const char*[?]", 1024) new("int[?]", 16) new("int[?]", 1024)

Results (1M iterations):

Assembler Results:

[n]: 113 instructions total. NYI on LuaJIT 2.0
VLA: 105 instructions total. NYI on LuaJIT 2.0

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	[n]	2.64742 sec(s)	2.20694	4.07516	2.68361	(105.73%)
2	VLA	2.50381 sec(s)	2.01546	3.85597	2.47497	(100%)

Conclusion:

For some reason LuaJIT 2.0 is not able to compile any C type. Use VLA if possible.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	[n]	4.32618 sec(s)	3.7442	5.66979	4.37286	(102.28%)
2	VLA	4.22957 sec(s)	3.54316	5.77651	4.20961	(100%)

Conclusion:

Use VLA if possible.

Predefines:

local bs = string.rep("----------", 1000) local t = {bs, bs, bs, bs, bs, bs, bs, bs, bs, bs} local concat = table.concat local format = string.format

Code 1:

local s = bs .. bs .. bs .. bs .. bs .. bs .. bs .. bs .. bs .. bs

Code 2:

local s = bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs s = s .. bs

Code 3:

local s = bs for i = 1, 9 do s = s .. bs end

Code 4:

concat(t)

Code 5:

format("%s%s%s%s%s%s%s%s%s%s", bs, bs, bs, bs, bs, bs, bs, bs, bs, bs)

Results (100k iterations):

Assembler Results:

Inline concat: 18 instructions total. NYI on LuaJIT 2.0
Separate concat: 18 instructions total. NYI on LuaJIT 2.0
Loop concat: 94 instructions total. NYI on LuaJIT 2.0
table.concat: 39 instructions total. NYI on LuaJIT 2.0
string.format: 18 instructions total. NYI on LuaJIT 2.0

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Inline, separate concat and string.format	0.00003 sec(s)	0.00003	0.00415	0.00012	(0.009986%) (10014 times faster)
2	Loop concat	6.70725 sec(s)	5.60963	8.0101	6.57035	(2232.55%) (22 times slower)
3	table.concat	0.30043 sec(s)	0.26492	0.37815	0.30172	(100%)

Conclusion:

Please notice that percentage calculation is taken from the other result.
This is an example when LuaJIT fails to optimize and compile code efficiently. The loop wasn't unrolled properly.
LuaJIT suggest to find a balance between loops and unrolls and use templates.
table.concat is best solution in complicated code, however, if it's possible make concats inline or unroll loops.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Inline concat	1.44256 sec(s)	1.42674	1.76183	1.46447	(100%)
2	Separate concat	5.82289 sec(s)	5.44671	7.76331	5.9645	(403.64%) (4 times slower)
3	Loop concat	6.61971 sec(s)	5.70944	7.64707	6.6218	(458.88%) (4 times slower)
4	table.concat	1.49022 sec(s)	1.41849	1.95012	1.56112	(103.30%)
5	string.format	1.46481 sec(s)	1.42773	2.05097	1.52796	(101.54%)

Conclusion:

If it's possible inline your concats, otherwise use table.concat.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	Inline concat	1.023 sec(s)	1.01	1.296	1.04552	(100%)
2	Separate concat	10.445 sec(s)	9.918	12.909	10.63149	(1021.01%) (10 times slower)
3	Loop concat	11.723 sec(s)	9.919	14.472	11.64345	(1145.94%) (11 times slower)
4	table.concat	2.151 sec(s)	2.083	2.378	2.16366	(210.26%) (2 times slower)
5	string.format	2.179 sec(s)	2.116	3.099	2.26572	(213%) (2 times slower)

Conclusion:

If it's possible inline your concats, otherwise use table.concat.

Predefines:

local TYPE_bool = "bool" local type = type local function isbool1(b) return type(b) == "bool" end local function isbool2(b) return type(b) == TYPE_bool end

Code 1:

isbool1(false)

Code 2:

isbool2(false)

Results (10M iterations):

Assembler Results:

KGC string: 18 instructions total.
Upvalued string: 18 instructions total.

Conclusion:

LuaJIT compiles them with the same performance.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	KGC string	0.39173 sec(s)	0.37698	0.63159	0.41579	(100%)
2	Upvalued string	0.40781 sec(s)	0.3934	0.51813	0.4151	(104.10%)

Conclusion:

If possible use literal strings in the function.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	KGC string	1.324 sec(s)	1.26	1.99	1.37005	(100%)
2	Upvalued string	1.3915 sec(s)	1.268	1.773	1.40522	(105.09%)

Conclusion:

If possible use literal strings in the function.

Predefines:

local function funcmret() return 1, 2 end local select = select

Code 1:

local _, arg2 = funcmret() return arg2

Code 2:

local arg2 = select(2, funcmret()) return arg2

Results (10M iterations):

Assembler Results:

With dummy variables: 18 instructions total.
select: 18 instructions total.

Conclusion:

LuaJIT compiles them with the same performance.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	With dummy variables	0.25193 sec(s)	0.24568	0.27575	0.25267	(100%)
2	select	0.38455 sec(s)	0.37498	0.4397	0.38579	(152.63%)

Conclusion:

select makes no sense for functions with less than 10 (at least) returned values, all returned values are pushed to the stack. Any value you choose will can be pushed up individually.
Tip: if you need only first argument wrap the function call in the parenthesizes.print( (math.frexp(0)) )
This will print only the first value.

Benchmark Results:

#	Name	Median	Minimum	Maximum	Average	Percentage
1	With dummy variables	0.611 sec(s)	0.6	0.702	0.61562	(100%)
2	select	0.813 sec(s)	0.786	0.926	0.81984	(133.06%)

Conclusion:

select makes no sense for functions with less than 10 (at least) returned values, all returned values are pushed to the stack. Any value you choose will can be pushed up individually.
Tip: if you need only first argument wrap the function call in the parenthesizes.print( (math.frexp(0)) )
This will print only the first value.