wiki: lua/micro-benchmarks

Benchmarks for basic stuff. These tests were run on a rpi 3 (Cortex A53).
The code is available here, with the accepted results being those where the
average time was within the minimum time and its %105 (low variance only
achievable on low loadavg).

######## TABLE ITERATION ########

There are several methods that allow iteration of tables in lua, specially on
"array" tables where the keys are the integers {1, 2, ..., n}.  Let's compare
this kind of tables first adding an array of random numbers, via the following
alternatives:

-> range: for i = 1, #t do local v=t[i]; ... end
-> ipairs: for i, v in ipairs(t) do ... end
-> pairs: for k, v in ipairs(t) do ... end

range results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 136ns·#t | 157ns·#t | 127ns·#t | 11.6ns·#t
  100 | 118ns·#t | 144ns·#t | 109ns·#t | 7.91ns·#t
 1000 | 117ns·#t | 161ns·#t | 107ns·#t | 7.54ns·#t

ipairs results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 348ns·#t | 348ns·#t | 331ns·#t | 17.2ns·#t
  100 | 292ns·#t | 291ns·#t | 276ns·#t | 9.22ns·#t
 1000 | 289ns·#t | 301ns·#t | 270ns·#t | 8.42ns·#t

pairs results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 328ns·#t | 345ns·#t | 341ns·#t | 72.1ns·#t
  100 | 270ns·#t | 291ns·#t | 280ns·#t | 53.7ns·#t
 1000 | 263ns·#t | 279ns·#t | 277ns·#t | 50.4ns·#t

Comments: Who have thought ipairs would be a little slower than pairs on lua5.1,
if anything one would have expected the opposite. 

If we now compare the results to the tight loop of the equivalent native code,
we can see that it should take about 4 cycles (3.3ns·#t on a rpi3).
Note that r2 is the vector's upper limit, r3 the array pointer, d6 the
current number to add and d7 the accumulator. The instruction latency, pipelines
used and throughput is also included.

 address,  bytes,  instruction, args,      | lat (pip) thr
  1060c: ecb36b02  vldmia  r3!, {d6}       |   4 (L) 1  loading
  10610: e1530002  cmp     r3, r2          |   1 (I) 1  comparing
  10614: ee377b06  vadd.f64     d7, d7, d6 |   5 (F) 2  add
  10618: 1afffffb  bne     1060c <fn+0x24> |   1 (B) 1  repeat until zero

Explaining why we got 4 cycles = 3.3ns per addition, even though this code could
be optimized a little bit :)  Let's see how luajit does in its best run:

The registers used are: r0, iteration limit (1000 in this case); r4, variable
"i"'s current value; r6, array address; r11, address of current element to add;
d14, current element to add; d15, accumulator.

The remaining register, lr, is used for holding the upper 32-bits of the current
64-bit floating point "number" to add, by adding 15=0xf and checking the carry
flag the code can detect the case when the all bits 31~4 are 1 and at least one
of the bits 0~3 is true, branching to 0x540020 on that case.

You may wonder, what's so special about those bits?  Well, as luajit's lj_obj.h
explains IEEE-754 64-bit floating point numbers (standard numbers) detect as a
special number NaN (Not a Number really) the values between:
    0x7ff0_0000_0000_0001 = inf + 1, and
    0x7fff_ffff_ffff_ffff, and for -NaN:
    0xfff0_0000_0000_0001 = -inf - 1, and
    0xffff_ffff_ffff_ffff.
However, of those 2*(2**52 - 1) possible NaNs only a few are used, having those
with the most significant word > 0xfff8_0000 available for other uses! Luajit
uses those bits to store the type of any other (non-number) object, for nil the
value ~0 = -1 = 0xffff_ffff is used, for false ~1 = -2 = 0xffff_fffe, ~4 for
strings, etc... on these cases the least significant word is used as well (e.g.
for storing the address of the strings), so that now storing a number consumes
the same amount of memory as storing a reference to an object.

  547ea8:  add    r11, r6, r4, lsl #3 | address r11 <- r6 + r4*8
  547eac:  ldr    lr, [r11, #4]       | load the into lr
  547eb0:  vldr   d14, [r11]          | load the full 64-bit number into d14
  547eb4:  cmn    lr, #15             | ensure the number was really a number
  547eb8:  blcs   0x540020            | handling the special case otherwise
  547ebc:  vadd.f64   d15, d14, d15   | add d14 into the accumulator
  547ec0:  add    r4, r4, #1          | advance the index to the next position
  547ec4:  cmp    r4, r0              | compare to the limit
  547ec8:  ble    0x547ea8            | and repeat while it is not surpassed

This is how luajit achieves near optimum performance while having to deal with
heterogeneous arrays.

But on non contiguous arrays aka hash tables, the result is very different, the
next test deals with the iteration of a table where the keys are now the odd
numbers {1, 3, ..., 2n-1}.

pairs_odd results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 494ns·#t | 510ns·#t | 438ns·#t | 85ns·#t
  100 | 432ns·#t | 454ns·#t | 388ns·#t | 63ns·#t
 1000 | 429ns·#t | 444ns·#t | 389ns·#t | 57ns·#t

range_odd results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 239ns·#t | 285ns·#t | 148ns·#t | 11.9ns·#t
  100 | 230ns·#t | 278ns·#t | 131ns·#t | 7.92ns·#t
 1000 | 231ns·#t | 277ns·#t | 140ns·#t | 7.54ns·#t

Looking at the machine code generated by luajit this time, the only difference
occurs in the add r4, r4, #1 line, that now adds #2, that's why the performance
was so similar; even though the array is not contiguous, luajit stores it in a
plain array interleaved with nils.

######## TABLE INSERTION ########

-> insert: table.insert(t, item)
-> insert2: local insert = table.insert; insert(t, item)
-> len: t[#t + 1] = item
-> len2: t[tnext], tnext = item, tnext + 1

insert results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 541ns·#t | 658ns·#t | 645ns·#t | 130ns·#t
  100 | 542ns·#t | 656ns·#t | 645ns·#t | 130ns·#t
 1000 | 537ns·#t | 654ns·#t | 641ns·#t | 127ns·#t

insert2 results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 418ns·#t | 513ns·#t | 519ns·#t | 127ns·#t
  100 | 425ns·#t | 516ns·#t | 520ns·#t | 129ns·#t
 1000 | 424ns·#t | 518ns·#t | 520ns·#t | 129ns·#t

len results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 317ns·#t | 360ns·#t | 328ns·#t | 127ns·#t
  100 | 318ns·#t | 366ns·#t | 331ns·#t | 130ns·#t
 1000 | 320ns·#t | 367ns·#t | 332ns·#t | 130ns·#t

len2 results:
   #t |  lua 5.1 |  lua 5.2 |  lua 5.3 |  luajit
   10 | 141ns·#t | 176ns·#t | 140ns·#t | 24ns·#t
  100 | 144ns·#t | 180ns·#t | 141ns·#t | 27ns·#t
 1000 | 144ns·#t | 180ns·#t | 141ns·#t | 28ns·#t

######## STRING ITERATION ########

-> gsub: for c in s:gmatch '.' do ... end
-> sub: for i = 1, #s do local c = s:sub(i, i); ... end
-> byte: for i = 1, #s do local b = s:byte(i, i); ... end


                                                                              π