wiki: blog/drawing-with-iptables-harder-drive-followup



To understand what I'm talking about here, I *really* suggest watching
this video titled: "Harder Drive: Hard drives we didn't want or need":

  -> https://www.youtube.com/watch?v=JcJSW7Rprio

which at the 7:46 it says: "and it wouldn't be that hard to embed some
kind of picture or message in here, like how cool would it be to embed
a little qr code in this... right? it'd be a little bit cool... maybe
a little embarrassing too.  Anyway, that's the Internet."

This means that for this blog post we are going to paint the following
picture of me, taken in the Korankei gorge, located in Asuke-chō, Aichi,
Japan.  Yes, I know, the background is not very recognizable.

            

So, the trivial solution would be to convert each black or white pixel
into a rule in our router's firewall, for example to draw a black pixel:
  iptables -A $some_chain -d $pixel_ip -j DROP

However this would not only require at least 7839 rules for this
small image making our firewall huge and slow, but would completely lack
any fun, and so in the spirit of a Harder Drive it makes sense to find an
overly complicated way to achieve the same result.

Btw, to convert the IP into the coordinates of the image, we can use the
transformation already mentioned on the video, called Hilbert curve:
  https://en.wikipedia.org/wiki/Hilbert_curve
and so I'm going to steal the implementation written there.



Did you know iptables supports a module called "bpf"?

This bpf module allows us to associate a small program (a small sequence of
byte-codes) to a rule, that will be run for each packet to determine if it
matches or not.  Even though its computation model is not Turing complete,
it give us a great lot of flexibility.

To write bpf programs we have basically two alternatives, one of them is
using the pcap library directly to write the compile it from a pcap
expression, either by using tcpdump on a raw interfaces such as tun*, or
with a script like this one (changing «code» in the first line to specify
the program you would like to compile):

$ luajit -e 'code = "dst 10.71.1.1"
local ffi = require "ffi"
ffi.cdef [[
  struct bpf_insn { uint16_t code; uint8_t jt; uint8_t jf; uint32_t k; };
  struct bpf_program { unsigned int bf_len; struct bpf_insn *bf_insns; };
  int pcap_compile_nopcap(int, int, struct bpf_program *,
    const char *, int, uint32_t);]]
prog = ffi.new "struct bpf_program[1]"
assert(ffi.load "pcap".pcap_compile_nopcap(-1, 12, prog, code, 1, -1) == 0)
local t = {prog[0].bf_len}
for i = 0, prog[0].bf_len-1 do
  local ins = prog[0].bf_insns[i]      
  t[#t+1] = ("%d %d %d %d"):format(ins.code, ins.jt, ins.jf, ins.k)    
end                        
print(table.concat(t,","))'
7,48 0 0 0,84 0 0 240,21 0 3 64,32 0 0 16,21 0 1 172425473,6 0 0 4294967295,6 0 0 0

Here is the documentation for pcap/tcpdump expressions:

  -> https://www.tcpdump.org/manpages/pcap-filter.7.html

And so to disassemble bpf byte-code you can use this program:

  -> https://github.com/cloudflare/bpftools/blob/master/linux_tools/bpf_dbg.c

Let's then disassemble the program we just compiled:

$ ./bpf_dbg
> load bpf 7,48 0 0 0,84 0 0 240,21 0 3 64,32 0 0 16,21 0 1 172425473,6 0 0 4294967295,6 0 0 0
> disassemble
l0:     ldb [0]
l1:     and #0xf0
l2:     jeq #0x40, l3, l6
l3:     ld [16]
l4:     jeq #0xa470101, l5, l6
l5:     ret #0xffffffff
l6:     ret #0

It makes sense right?
  * l0 loads the first byte of the IP packet into the register A,
       this byte contains the IP version in the upper nibble.
  * l1 discards the lower nibble: it's doing a bit-wise and: A = A & 0xf0
  * l2 compares the register A with the immediate value 0x40 jumping
       to l3 if it is equal, or l6 otherwise.
  * l3 loads the word (32 bits) at offset 16 which in an IP packet it is
       the destination address, into the register A.
  * l4 compares the register A with the immediate value 0x0A470101 which
       is the IP 10.71.1.1 written in hex, if it matches it jumps to l5,
       if it doesn't, it matches to l6.
  * l5 returns -1, which by being different to 0 it means "match packet"
  * l6 returns 0, meaning "don't match packet".

In other words, as expected the program "dst 10.71.1.1" matches a given
packet only if it has IP version 4, and if its destination IP address is
10.71.1.1.

The other alternative, which we will be using today, is to write bpf
programs in assembly and compile them using an assembler called "bpfc",
which is included in the netsniff-ng debian package.

In order to do this we need some knowledge about how the BPF model works.

* We have two registers, A and X.  A the accumulator can be used for
  computations, and the X register can be used for indirect addressing.
* There is a scratch memory bank composed of 16 32bit registers M[0] to
  M[15], which doesn't support indexed addressing and can only be used
  on the instructions: ld, ldx, st, stx for transferring to and fro the
  registers A and X.
* Arithmetic and logic operations will always use A as the left operand
  and destination register, while the right operand can either be a 32-bit
  immediate value or the register X:
    a := a «op» #imm
    a := a «op» x
  The following operations are provided: arithmetic: add, sub, mul, div, mod;
  logic: or, and, lsh, rsh, xor; unary: neg (basically A := -A).
* Only jumps forward are allowed, the destination address is always a
  relative immediate unsigned offset: This is important because ensures all
  programs will terminate in at most as many steps as its size; no, there's
  no call stack nor any kind of branch and link BL instruction.  There are no
  functions/subroutines at all.
* Programs can finish by returning a constant (ret #imm) or the value of the
  register A (ret a).  A return value of zero will not match the packet.
* Furthermore the maximum size allowed for BPF programs can be quite small as
  we will see in the following section.

If you want more detail about the classic BPF, here is the paper with a good
introduction to BPF (it's a nice reading material), and the reference manual:

  -> https://www.tcpdump.org/papers/bpf-usenix93.pdf
  -> https://www.freebsd.org/cgi/man.cgi?bpf



So... how much space do we have?: 64, yes only 64 instructions... 😞

  -> https://git.netfilter.org/iptables/tree/include/linux/netfilter/xt_bpf.h?id=18c96821b5901ac5c66dcbc5f299bd07ef5569ef#n8

Let's see, we can at least store 32 bits in a single immediate value, which
we can use by indexing it with the lowest 5 bits of the IP address, in C it
would be: (SOME_CONSTANT >> (address & 0x1f)) & 1.  This in BPF can be
programmed as:
l0: ld [16]     // a := destination ip address
l1: and #0x1f   // a := a & 0x1f, leaving a number between 0 and 31
l2: tax         // x := a, copy into x so that we can shift right by it
l3: ld #BITSET  // a := BITSET, basically the value for 32 pixels
l4: rsh x       // a := a >> x, discards the lowest x bits
l5: and #1      // a := a & 1, leaving only the selected bit.
l6: ret a       // finish the program with either 1 or 0

To see how a right shift «rsh» can be used for indexing a bit, we can see these
two examples: if the IP address in binary ends with 00000, at l2: x will be set
to 0, and then the shift at l4 will not modify the BITSET value at register a,
meaning that l5 will select the lowest bit.

If the IP address on the other hand ends with 11111, at l2: x=31 (11111 in
binary is 31 in decimal), then the shift at l4 will shift the register A 31
bits to the right, and after the «and» at l5, only 0 or 1 will remain, it
being the content of the highest bit of BITSET.

So, how can we use more bits of the IP address to select what constant to use?
In any normal programming language we would normally just provide a table of
constants, like:

const uint32_t MY_TABLE[] = {
  0x7fcbece9, 0xfffff7ff, 0xf7fffdff, 0xff7d777b,
  0xfed7cfff, 0xfdfffbef, 0x639ff99f, 0x600d10f2,
  0x9ffc0021, 0x8800bffc, 0x23858602, 0xffffff47,
  0xefffffff, 0xffffffde, 0xffffffee, 0xebb3dffd };

and that would be the end of the problem, however there's no way to do this
in BPF, yes, we have an array of 16 memory locations, but there's no indexed
loading, and we cannot pre-initialize it.

What we can do is to have a tree of branches!  There's a very useful
instruction called «jset» that does (a & «imm») jumping to a given location
if the value is non-zero and to another given location if it is zero, this
means that we can build a binary tree to jump to a specific «ld #BITSET», so
for 2 bits it can be:

        jset #0x00000040, L1,  L0
L0:     jset #0x00000080, L01, L00
L1:     jset #0x00000080, L11, L10
L00:    ld #BITSET00
        jmp l4
L01:    ld #BITSET01
        jmp l4
L10:    ld #BITSET10
        jmp l4
L11:    ld #BITSET11

And for N additional bits, this code would take 3*(2^N)-1 instructions, meaning
that for N=5 we would have 95 instructions, going over our total limit of 64.
For N=4 instead we have 47 instructions, plus the 6 we had before, we are well
within the margin.  In the end we've managed to check against 9 bits (4 for the
binary tree + 5 by right shifting a 32 bits constant with the lowest 5 bits of
the address), and we still have some additional space to check the remaining
32-9=23 high bits of the IP address against a constant, by starting the
program with, with in the end a grand total of 56 instructions:
        ld [16]
        and #NETMASK
        jeq #NETWORK, ok
        ret #0

To encode a 128*128 pixels image 2^14 pixels, we would then need 2^(14-9)=32
lines of iptables, something completely doable.



I just include the generator for sake of completeness, it's certainly not a
polished piece of code, but it does the job:

#!/usr/bin/env luajit
local bit = require 'bit'

-- I run this program as follows:
--   ./bpfimg.lua 127.42.0.0/18 yo.pbm | sudo bash
-- and to check the generated assembly:
--   ./bpfimg.lua -d 127.42.0.0/18 yo.pbm
local function parse_args(arg)
  local i = 1
  local debug = false
  if arg[i] == '-d' then
    debug = true
    i = i + 1
  end
  -- We parse the first argument as the network/mask we are matching:
  local addr_str, cidr_str = arg[i]:match '^(%d+%.%d+%.%d+%.%d+)/(%d+)$'
  local network = 0
  assert(addr_str, 'First arg must be the network/mask, eg: 127.42.0.0/18')
  addr_str:gsub('%d+', function(octet)
    local n = tonumber(octet)
    assert(0 <= n and n <= 255)
    network = network * 256 + n
  end)
  local cidr = tonumber(cidr_str)
  assert(18 <= cidr and cidr <= 32)
  assert(cidr % 2 == 0, 'Images must have natural aligment')
  local side = 2^((32 - cidr)/2)
  local npixels = side * side

  local netmask = -bit.lshift(1, 32-cidr)
  assert(bit.band(network, netmask) == bit.tobit(network))

  local imgdata
  do
    local fd = assert(io.open(arg[i+1]))
    local rawdata = fd:read '*a':gsub('#.-\n', '')
    local width, height, data = rawdata:match '^P1%s+(%d+)%s+(%d+)%s+(.*)'
    imgdata = data:gsub('%s', '')
    assert(imgdata, 'Image must be in XBM format')
    assert(tonumber(width) == side and tonumber(height) == side,
      'Expected square image of side ' .. side)
    assert(#imgdata == width * height)
    fd:close()
  end
  return network, cidr, side, npixels, imgdata, debug
end

local network, cidr, side, npixels, imgdata, debug = parse_args(arg)

-- This code is basically the C code at Wikipedia's article on Hilbert
-- curve translated to lua, this is why it's so "bit." heavy.
local function d2xy(d)
  -- parameter d and return value are all 0-based.
  local x, y, s = 0, 0, 1
  while s < 65536 do
    local rx = bit.band(1, bit.rshift(d, 1))
    local ry = bit.band(1, bit.bxor(d, rx))
    if ry == 0 then
      if rx == 1 then
        x = s-1 - x
        y = s-1 - y
      end
      x, y = y, x
    end
    x = x + s * rx
    y = y + s * ry
    s = bit.lshift(s, 1)
    d = bit.rshift(d, 2)
  end
  return x, y
end

-- Find corner with lowest values.  This could be optimized if we were
-- to abuse the fact that d2xy uses the Hilbert curve.
local ox, oy = d2xy(network)
for d = network, network + npixels-1 do
  local x, y = d2xy(d)
  ox, oy = math.min(x, ox), math.min(y, oy)
end

local function test_d2xy()
  local visited = {}
  local x, y = d2xy(network)
  local px, py = x-1, y
  for d = 0, npixels-1 do
    x, y = d2xy(d + network)
    assert(0 <= x-ox and x-ox < side)
    assert(0 <= y-oy and y-oy < side)
    assert(not visited[y*65536+x])
    visited[y*65536+x] = true
    assert(math.abs(x-px) + math.abs(y-py) == 1)
    px, py = x, y
  end
end
test_d2xy()

local genbits_visited = {}
-- Given an IP address as an integer number, calculate the bitset for
-- the 32 IPs block that the BPF code uses for the 5 LSb of the address
local function genbits(base)
  local ret = 0
  if base < 0 then base = base + 0x100000000 end
  for i = 0, 31 do
    local ip = base + i
    local rel = ip - network
    if 0 <= rel and rel < npixels then
      local x, y = d2xy(ip)
      -- we have to obtain the position in the XBM relative to the origin
      local pos = (y - oy) * side + (x - ox) + 1
      -- 49 is ('1'):byte(), representing the black color = DROP
      if imgdata:byte(pos) == 49 then
        -- LSb corresponds to the first IP in the range, since the bpf
        -- code does: (ret >> (IP&31)) & 1
        ret = ret + bit.lshift(1, i)
      end
    end
  end
  -- just to report some statics at the end:
  local visit_count = genbits_visited[ret] or 0
  if visit_count > 0 then
    genbits_visited.duplicates = (genbits_visited.duplicates or 0) + 1
  end
  genbits_visited.total = (genbits_visited.total or 0) + 1
  genbits_visited[ret] = visit_count + 1
  return ret
end

local function toipstring(ip)
  return ('%d.%d.%d.%d'):format(
    bit.rshift(ip, 24),
    bit.band(bit.rshift(ip, 16), 255),
    bit.band(bit.rshift(ip, 8), 255),
    bit.band(ip, 255))
end

-----------------------------------------------------------------------
-- Code generation!
-----------------------------------------------------------------------

-- If you run: sudo tcpdump -ilo -d 'dst host 127.0.0.1' you will see that the
-- generated code begins by checking the ETHERTYPE field on the ethernet frame
-- In iptables however the ethernet headers are not provided
-- (man 8 iptables-extensions):
--   > Iptables passes packets from the network layer up, without mac layer.

local chain_name = ('drop_icmp_%s/%d'):format(toipstring(network), cidr)

local function generate_rule(ip, cidr)
  if cidr < 23 then
    generate_rule(ip, cidr+1)
    return generate_rule(ip + bit.lshift(1, 32-cidr-1), cidr+1)
  end
  -- we print the comment in cyan:
  print(('# %srule for %s/%d%s'):format(
    '\27[36m', toipstring(ip), cidr, '\27[0m'))
  local prog = {}
  local function out(s) prog[#prog+1] = s end
  -- We don't really have to check the version, as IPv6 goes through ip6tables
  -- instead, so let's load the destination address located at offset 16:
  local netmask = -bit.lshift(1, 32-cidr)
  out '  ld [16]'

  -- Then we check the top bits to ensure we are in the correct network

  out('  and #0x' .. bit.tohex(netmask))
  out('  jeq #0x' .. bit.tohex(ip) .. ', ok')
  out '  ret #0'

  -- let's load the dst ip again, this time to filter the last 5 bits only
  out 'ok:'
  out '  ld [16]'
  out '  and #31'
  -- and save it in register x
  out '  tax'

  -- Now let's generate the tree for filtering the bits betwen cidr and 26
  -- inclusive.  If there are no bits to test (cidr >= 27) then we just
  -- load the immediate bits for band(network, 0xffffffe0)

  if cidr >= 27 then
    prog[#prog+1] = '  ld #0x' .. bit.tohex(genbits(bit.band(ip, 0xffffffe0)))
  else
    local function dump_tree(label, ip, cidr)
      out(('%s:'):format(label))
      -- we check against 32-5=27 to interleave the constants with the tree
      -- and so theoretically support a larger jump tree if the number of
      -- maximum instructions would be larger, like 2048, as all jumps would
      -- fall within the 8-bit limit for jump offsets.
      if cidr == 27 then
        out(('  ld #0x%s'):format(bit.tohex(genbits(ip))))
        if not label:find '^L_1+$' then
          out '  jmp finish'
        end
      else
        local b = bit.lshift(1, 32-cidr-1)
        out(('  jset #0x%s, %s1, %s0'):format(bit.tohex(b), label, label))
        dump_tree(label..'0', ip, cidr+1)
        dump_tree(label..'1', ip + b, cidr+1)
      end
    end
    dump_tree('L_', ip, cidr)
  end

  out 'finish:'
  out '  rsh x'  -- a = a >> x, with x being the last 5 bits of dst addr
  out '  and #1'
  out '  ret a'
  io.write('iptables -A ' .. chain_name .. ' -m bpf --bytecode "')
  local fd = assert(io.popen('bpfc -p -f xt_bpf -i -' ..
    (debug and ' -V' or ''), 'w'))
  fd:write(table.concat(prog, '\n'))
  fd:close()
  print '" -j DROP'
end

-- this header makes sure that our chain does not filter anything but pings:
print(('iptables -N %s || iptables -F %s'):format(chain_name, chain_name))
print('iptables -A '..chain_name.." '!' -p icmp -j RETURN")
print('iptables -A '..chain_name.." -p icmp '!' --icmp-type ping -j RETURN")
generate_rule(network, cidr)

-- there was some space for a bit more optimization
print(('# total=%d, duplicated=%d'):format(
  genbits_visited.total, genbits_visited.duplicates))



If for some reason you need to have a dense set of IP addresses, then please
by all means just use ipset with type "bitmap:ip" or "hash:ip" (if the set is
sparse), or any of the additional types for other use cases.

If you need more complex stuff, you probably should use eBPF instead of BPF.

Maybe in the future I could write a second installment with a tutorial for
doing exactly this, but by writing a super efficient eBPF program instead.


                                                                              π