Benchmarks for basic stuff. These tests were run on a rpi 3 (Cortex A53). The code is available here, with the accepted results being those where the average time was within the minimum time and its %105 (low variance only achievable on low loadavg). ######## TABLE ITERATION ######## There are several methods that allow iteration of tables in lua, specially on "array" tables where the keys are the integers {1, 2, ..., n}. Let's compare this kind of tables first adding an array of random numbers, via the following alternatives: -> range: for i = 1, #t do local v=t[i]; ... end -> ipairs: for i, v in ipairs(t) do ... end -> pairs: for k, v in ipairs(t) do ... end range results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 136ns·#t | 157ns·#t | 127ns·#t | 11.6ns·#t 100 | 118ns·#t | 144ns·#t | 109ns·#t | 7.91ns·#t 1000 | 117ns·#t | 161ns·#t | 107ns·#t | 7.54ns·#t ipairs results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 348ns·#t | 348ns·#t | 331ns·#t | 17.2ns·#t 100 | 292ns·#t | 291ns·#t | 276ns·#t | 9.22ns·#t 1000 | 289ns·#t | 301ns·#t | 270ns·#t | 8.42ns·#t pairs results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 328ns·#t | 345ns·#t | 341ns·#t | 72.1ns·#t 100 | 270ns·#t | 291ns·#t | 280ns·#t | 53.7ns·#t 1000 | 263ns·#t | 279ns·#t | 277ns·#t | 50.4ns·#t Comments: Who have thought ipairs would be a little slower than pairs on lua5.1, if anything one would have expected the opposite. If we now compare the results to the tight loop of the equivalent native code, we can see that it should take about 4 cycles (3.3ns·#t on a rpi3). Note that r2 is the vector's upper limit, r3 the array pointer, d6 the current number to add and d7 the accumulator. The instruction latency, pipelines used and throughput is also included. address, bytes, instruction, args, | lat (pip) thr 1060c: ecb36b02 vldmia r3!, {d6} | 4 (L) 1 loading 10610: e1530002 cmp r3, r2 | 1 (I) 1 comparing 10614: ee377b06 vadd.f64 d7, d7, d6 | 5 (F) 2 add 10618: 1afffffb bne 1060c <fn+0x24> | 1 (B) 1 repeat until zero Explaining why we got 4 cycles = 3.3ns per addition, even though this code could be optimized a little bit :) Let's see how luajit does in its best run: The registers used are: r0, iteration limit (1000 in this case); r4, variable "i"'s current value; r6, array address; r11, address of current element to add; d14, current element to add; d15, accumulator. The remaining register, lr, is used for holding the upper 32-bits of the current 64-bit floating point "number" to add, by adding 15=0xf and checking the carry flag the code can detect the case when the all bits 31~4 are 1 and at least one of the bits 0~3 is true, branching to 0x540020 on that case. You may wonder, what's so special about those bits? Well, as luajit's lj_obj.h explains IEEE-754 64-bit floating point numbers (standard numbers) detect as a special number NaN (Not a Number really) the values between: 0x7ff0_0000_0000_0001 = inf + 1, and 0x7fff_ffff_ffff_ffff, and for -NaN: 0xfff0_0000_0000_0001 = -inf - 1, and 0xffff_ffff_ffff_ffff. However, of those 2*(2**52 - 1) possible NaNs only a few are used, having those with the most significant word > 0xfff8_0000 available for other uses! Luajit uses those bits to store the type of any other (non-number) object, for nil the value ~0 = -1 = 0xffff_ffff is used, for false ~1 = -2 = 0xffff_fffe, ~4 for strings, etc... on these cases the least significant word is used as well (e.g. for storing the address of the strings), so that now storing a number consumes the same amount of memory as storing a reference to an object. 547ea8: add r11, r6, r4, lsl #3 | address r11 <- r6 + r4*8 547eac: ldr lr, [r11, #4] | load the into lr 547eb0: vldr d14, [r11] | load the full 64-bit number into d14 547eb4: cmn lr, #15 | ensure the number was really a number 547eb8: blcs 0x540020 | handling the special case otherwise 547ebc: vadd.f64 d15, d14, d15 | add d14 into the accumulator 547ec0: add r4, r4, #1 | advance the index to the next position 547ec4: cmp r4, r0 | compare to the limit 547ec8: ble 0x547ea8 | and repeat while it is not surpassed This is how luajit achieves near optimum performance while having to deal with heterogeneous arrays. But on non contiguous arrays aka hash tables, the result is very different, the next test deals with the iteration of a table where the keys are now the odd numbers {1, 3, ..., 2n-1}. pairs_odd results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 494ns·#t | 510ns·#t | 438ns·#t | 85ns·#t 100 | 432ns·#t | 454ns·#t | 388ns·#t | 63ns·#t 1000 | 429ns·#t | 444ns·#t | 389ns·#t | 57ns·#t range_odd results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 239ns·#t | 285ns·#t | 148ns·#t | 11.9ns·#t 100 | 230ns·#t | 278ns·#t | 131ns·#t | 7.92ns·#t 1000 | 231ns·#t | 277ns·#t | 140ns·#t | 7.54ns·#t Looking at the machine code generated by luajit this time, the only difference occurs in the add r4, r4, #1 line, that now adds #2, that's why the performance was so similar; even though the array is not contiguous, luajit stores it in a plain array interleaved with nils. ######## TABLE INSERTION ######## -> insert: table.insert(t, item) -> insert2: local insert = table.insert; insert(t, item) -> len: t[#t + 1] = item -> len2: t[tnext], tnext = item, tnext + 1 insert results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 541ns·#t | 658ns·#t | 645ns·#t | 130ns·#t 100 | 542ns·#t | 656ns·#t | 645ns·#t | 130ns·#t 1000 | 537ns·#t | 654ns·#t | 641ns·#t | 127ns·#t insert2 results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 418ns·#t | 513ns·#t | 519ns·#t | 127ns·#t 100 | 425ns·#t | 516ns·#t | 520ns·#t | 129ns·#t 1000 | 424ns·#t | 518ns·#t | 520ns·#t | 129ns·#t len results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 317ns·#t | 360ns·#t | 328ns·#t | 127ns·#t 100 | 318ns·#t | 366ns·#t | 331ns·#t | 130ns·#t 1000 | 320ns·#t | 367ns·#t | 332ns·#t | 130ns·#t len2 results: #t | lua 5.1 | lua 5.2 | lua 5.3 | luajit 10 | 141ns·#t | 176ns·#t | 140ns·#t | 24ns·#t 100 | 144ns·#t | 180ns·#t | 141ns·#t | 27ns·#t 1000 | 144ns·#t | 180ns·#t | 141ns·#t | 28ns·#t ######## STRING ITERATION ######## -> gsub: for c in s:gmatch '.' do ... end -> sub: for i = 1, #s do local c = s:sub(i, i); ... end -> byte: for i = 1, #s do local b = s:byte(i, i); ... end