wiki: blog/concurrent-updates-on-postgresql

In this blog post we will do several attempts at implementing a simple backend
function, and while doing so, we will be identifying some performance and
concurrency issues.  Hopefully reaching a decent solution by the end.



We have a table called counters on PostgreSQL with two columns:
  id serial primary key,
  count int not null.

We now want to implement a function that given an id, increases that id by
one and returns its new value.

  -- pseudocode to illustrate the desired behavior.
  -- return the updated value (nil if not found)
  function increment_counter(id)
    local count = Counter[id]
    if count == nil then return end
    local new_value = count + 1
    Counter[id] = new_value
    return new_value
  end

  -- and a batch version to increment multiple counters at once, returning a
  -- map where the keys are the ids, and the values the new counter values:
  function batch_increment_counters(ids)
    local res = {}
    for i, id in ipairs(ids) do
      res[id] = increment_counter(id)
    end
    return res
  end




For some weird reason there are some developers that prefer using unsafe,
verbose and poorly performing libraries instead of writing typesafe and simple
SQL, removing useless layers of abstraction in the process.

  -- here the code follows the same imperative pattern but now we query for
  -- a counter object instead.  That counter object will magically know when
  -- it is modified so that "updates" get sent when we do session:flush() or
  -- session:commit():
  function increment_counter(session, id)
    -- this is the same as doing: SELECT * FROM counters WHERE id=$1;
    local counter = session:query(Counter):filter_by{id=id}:first()
    if counter == nil then return end
    local new_value = counter.count + 1
    counter.count = new_value
    return new_value
  end

  -- same code as before, except for the db session:
  function batch_increment_counters(session, ids)
    local res = {}
    for i, id in ipairs(ids) do
      res[id] = increment_counter(session, id)
    end
    return res
  end

We could argue in favor of using an ORM that this allowed us to have all the
query code in the host language instead of having to deal with it as strings
(as would happen if we were manually using SQL).  Yet this is a false
dichotomy, as it's possible to easily build the queries from an AST, and there
are tools for languages with static type checking that can even verify a
certain degree of soundness of those ASTs at compile time, ie: jOOQ.

However by using an ORM you are distancing yourself from the real code that
you'll be running, and unless you're an expert on the specific ORM being used
(and even if you are) it's quite common to find implicit queries run on
unexpected moments.

And while in most cases your code will look perfectly ok, the generated SQL
could end up querying several tables multiple times with tons of useless
joins!  And you will never notice how awful they are unless you test
thoroughly the code while carefully looking at the mountains of badly
generated SQL... or until your app goes to production and is hit with a
burst of real usage on real data.  Then you'll need to do large refactors
so that the ORM magically manages to generate what you were expecting in the
first place.

The main fallacy of ORMs is the belief that you won't need to learn SQL if
you decide to use it.  In practice however you'll end up writing ORM calls
expecting it's generating what you were expecting, and then rewriting them
once you realize it isn't.  And unless you have deep knowledge of SQL and how
it works, you may end up with a disaster waiting to happen.

But the main problem is that ORMs encourage a model where the data is fetched
into the program memory for its processing.  While what we really need for
our code to be performant and safe is to have the logic written in a
declarative fashion so that it can be correctly adapted to RDBMs.

Pure functional or declarative code allows us some side benefits like being
able to retry queries (with application code in the middle) in case of dirty
reads.  Haskell's STM Monad is a beautiful example of how this is done.

Enough ORM bashing (for now), let's go back to the example and see what really
happens when you run that code trying to update ids={2,7,8}:

  BEGIN;
  SELECT 1; -- why not? orms love adding useless load
  SELECT counters.id "counter_id", counters.count "counter_count"
    FROM counters
    WHERE id=2;
  SELECT counters.id "counter_id", counters.count "counter_count"
    FROM counters
    WHERE id=7;
  SELECT counters.id "counter_id", counters.count "counter_count"
    FROM counters
    WHERE id=8;

  -- then the logic of your application will run until you do
  -- session.commit() and you see (assuming there was no counter with id=7):
  UPDATE counters SET count=37 WHERE id=2;
  UPDATE counters SET count=891 WHERE id=8;
  COMMIT;

You then do some unit testing (using a real db instance even): it works.
Plus some manual testing for good measure: it works.

Ready for deploy...?
  Not so fast, we are just starting!




Let's talk about "lost updates".

A lost update, as its name suggest, happens when a change is totally or
partially overwritten by another user that wasn't aware of it.

Let's say we have two users.
  1. User A reads a variable foo, and remembers that its value is 42.
  2. User B reads the same variable foo, remembering its value: 42.
  3. Then after having lunch user A decides to overwrite foo with the
     value being remember plus 1, with 43.
  4. Finally B, that took a bit longer decides to do the same: overwrite foo
     with the remember value plus one, 43.

So in the end foo gets set to 43.  However if A and B would have done that
process on different days, the value would have ended up being 44 instead.
As you may now realize, the previous code has exactly this same issue.

There are two approaches to this:
  * either lock so that when user B tries to read at step 2, that request gets
    blocked until user A finishes,
  * or using the more performant "UPDATE" alternative that combines both
    operations in a single one:

      UPDATE counters SET count=count+1 WHERE id=$1 RETURNING count;

So let's use the UPDATE alternative and observe how we can no longer the
traditional imperative code we had before, now we are mandated to write a
bunch of calls to build the query in a language that seems to have a
resemblance with SQL, but with possible weird behaviors:

Achivement unlocked: ORM becoming useless, code completely rewritten!

  function increment_counter(session, id)
    -- this is the same as doing: SELECT * FROM counters WHERE id=$1;
    local query = Counter.update()
      :returning(Counter.c.count)
      :filter_by{id=id}
      :values{count=Counter.c.count + 1}
    local new_value = session:execute(query):first()
  end

And we also need to change the batch version, that changes so much that it
makes sense to start from the new increment_counter:

  function batch_increment_counters(session, ids)
    -- this is the same as doing: SELECT * FROM counters WHERE id=$1;
    local query = Counter.update()
      :returning(Counter.c.id, Counter.c.count)
      :where(Counter.c.id:in_(ids))
      :values{count=Counter.c.count + 1}

    local res = {}
    for _, row in ipairs(session:execute(query):fetchall()) do
      res[row.id] = row.count
    end
  end

  -- and now it makes sense to rewrite once more time the single version:
  function increment_counter(session, id)
    return batch_increment_counters(session, {id})[id]
  end

Now we are no longer using the ORM as an ORM, instead now it has become a
mere translation layer.  Get ready to having to learn two different languages
for the same; and don't be surprised if magically these seemingly direct
translations sometimes get automatically impoverished by useless joins and
weird side effects of our host languages.  Even here the ORM will do whatever
it can in order to make our lives as painful as possible.




Ok, ok, now that we are not using the ORM as an ORM and that we changed the
code to do atomic updates insead... is it finally working?

Once again: unit testing -> it works.
Manual testing -> it works.
We even run the same code in two processes in parallel, and it works.

does it really work? ... 

well... not really...

let's do some testing:

  fran=# create table counters(id int primary key, count int not null);
  CREATE TABLE
  fran=# insert into counters select n, 0 from generate_series(1, 1000000) n;
  INSERT 0 1000000

now in two different ttys let's run this same one-line-script to run 100
updates, each randomly updating about 1% of the table:

  $ for i in `seq 1 100`; do echo '
      update counters
      set count = count+1
      where (id*random())::int%100 = 0' | psql; done
UPDATE 9827
UPDATE 9893
UPDATE 9837
UPDATE 9847
UPDATE 10022
UPDATE 10078
UPDATE 9897
UPDATE 9751
UPDATE 9729
UPDATE 9938
UPDATE 9877
UPDATE 10014
UPDATE 9993
UPDATE 9981
UPDATE 9915
UPDATE 9869
UPDATE 10040
UPDATE 9764
UPDATE 9890
UPDATE 10029
UPDATE 10022
UPDATE 9912
UPDATE 9760
UPDATE 9864
UPDATE 9653
UPDATE 9986
UPDATE 10058
UPDATE 9956
UPDATE 9847
UPDATE 9957
ERROR:  deadlock detected
DETAIL:  Process 20052 waits for ShareLock on transaction 680; blocked by process 20065.
Process 20065 waits for ShareLock on transaction 679; blocked by process 20052.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (6328,261) in relation "counters"
UPDATE 9832

(wat.gif)!?




TLDR: PostgreSQL update locks the updated rows not necessarily in order.

PostgreSQL is an MVCC (multi version concurrency control) database, which
means that for each transaction either all of the changes will be applied or
none of the changes will.  And this is how it's done:

  * All transactions are numbered by an increasing transaction id.
  * All the rows have xmin and xmax version numbers, xmin with being the id of
    the transaction that created that row and xmax the id of the transaction
    that deleted it.
  * When there's an update, the updated row data doesn't change, what really
    happens is that its xmax is set, and the an updated copy of the row gets
    created into the database.

When querying, PostgreSQL compares the id of the current transaction
"txid_current" with xmin, ignoring rows with old xmin ids and those with
larger uncommitted xmin.

When updating it will also check at xmax (along with other fields) to see if
the row that it's trying to update has already been updated by another
uncommitted transaction, in which case it blocks until that transaction is
either committed or rolled back.

What then could happen is:
  * user A updates row1
  * user B updates row2
  * user A tries to update row2, but it's blocked by B.
  * user B tries to update row1, but it's blocked by A.

Point in which this transaction by B is rolled back with a "deadlock detected"
error, to avoid blocking these two transactions to infinity and beyond.

So...




Ooof, how about doing one update at a time with a strict order: That way we
could avoid deadlocks (this is called 2PC):

  https://en.wikipedia.org/wiki/Two-phase_commit_protocol

However doing the updates one at a time is really slow... it's completely
unacceptable.  So...




Thankfully, there's a better alternative: "SELECT FOR UPDATE".  What select
for update does is to lock the selected rows (by setting its xmax), allowing
us to specify an order for locking! and we can even place these SELECT FOR
UPDATE as a subquery of the updates for extra performance!

This is how the previous one-line-script would look like:

  $ for i in `seq 1 100`; do echo '
      update counters
      set count = count+1
      where id in (
        select id
        from counters
        where (id*random())::int % 100 = 0
        order by id
        for update
      )' | psql; done

Q: If this can be fixed by locking in order, why doesn't PostgreSQL do this?
A: Because we were updating random rows spread all around the table.  If the
   server would need to update them in order, it would imply having set
   those xmax randomly all around the table (which is obviously slow)...

Q: How much slower is it?
A: The previous test on my computer has total time: 19s vs 36s (about twice
   as much).

Q: Would it be possible to try to update them without the SELECT FOR UPDATE
   and if it fails, then update it with it?
A: Yup, you could do that, but the note that the deadlocks are only avoided if
   all the transactions involved respect the same order.  So if you were to
   just keep retrying using the random order, there could be some pathological
   case where you'd end up deadlocking forever.  I have a hunch that by using
   2PC (the for update) for the updates following the deadlock, the chances of
   it occurring again would be exponentially smaller with the number of
   retries.

   So in the case of a failure you would see:
     BEGIN;
     ... -- some queries here
     SAVEPOINT xyz;
     UPDATE table SET ... WHERE conditions;
     -- deadlock detected! (with low probability ^_^)
     ROLLBACK TO SAVEPOINT xyz;
     UPDATE table SET ... WHERE ids IN (SELECT ids FROM table WHERE conditions
       ORDER BY ids FOR UPDATE)
     RELEASE SAVEPOINT xyz;
     ... -- continue with transaction normally

   And (usually) when it works:
     BEGIN;
     ... -- some queries here
     SAVEPOINT xyz;
     UPDATE table SET ... WHERE conditions;
     RELEASE SAVEPOINT xyz;
     ... -- continue with transaction normally

   Note that the updates with the SELECT FOR UPDATE could deadlock as well.



This appendix is answer to: "What would you use instead of an ORM?"

Either something like jOOQ, or if I felt like wanting to write something from
scratch, how about using a simple raw database library like:

  https://github.com/fcr--/lua-pgdriver

With that in mind, let's start with a simple version:

  function increment_counter(db, id)
    return db:mexec([[
      UPDATE counters SET count = count + 1 WHERE id = $1 RETURNING count
    ]], {{id}})[1][1].count
  end

  -- The code is just building the query and that's it:
  function batch_increment_counters(db, ids)
    local params = {}
    for i=1,#ids do params[i] = '$'..i end

    local res = {}
    for row in db:mquery([[
      UPDATE counters
      SET counter = counter + 1
      WHERE id IN (
        SELECT id
        FROM counters
        WHERE id IN (]] .. table.concat(params, ',') .. [[)
        ORDER BY id
        FOR UPDATE)
      RETURNING id, count;
    ]], {ids})() do
      res[tonumber(row.id)] = tonumber(row.count)
    end
    return res
  end

One way to improve it, is to have a template language, that could be as simple
as this basic function I wrote in a couple of hours: expand_template.

expand_template takes a string with the template and a table of options that
is used according to the patterns mentioned inside the template:

  $name: Simple mapping. It will be replaced by a unique-in-the-query
      '$'..number: expand_template 'SELECT $foo, $bar' == 'SELECT $1, $2'.

  ${name=subtemplate}: Loop n-times the inner template subtemplate.  This is
      useful when you need to expand a variable number of items as is usually
      the case in insert statements:
      expand_template('INSERT INTO foo VALUES ${foos=($id, $name)}',
        {foos_count=2}) == 'INSERT INTO foo VALUES ($1, $2),($3, $4)'.

  ${name?subtemplate}: Conditional expansion.  If name is true then
      subtemplate will be expanded in its place, or an empty string if false:
      expand_template('SELECT 1{x?+$n}', {x=true}) == 'SELECT 1+$1'.
      expand_template('SELECT 1{x?+$n}', {x=false}) == 'SELECT 1'.

  $_: Single argument replacement. This can be used when there's a single
      argument in a template or subtemplate: expand_template 'foo $_'=='foo $1'

And this is its source code; note that it's so simple that we didn't even need
lpeg for this.  Just basic top-down recursive pattern matching:

  local function expand_template(template, opts)
    local argcount = 0
    local function rec(template, mapping, mapping_submappers)
      -- we keep a dictionary with the mappings from parameter name to
      -- argument index:
      mapping = mapping or {}
      -- and (for the recursive formatters) a table from the name to an
      -- array of the submappers
      mapping_submappers = mapping_submappers or {}

      template = template:gsub('%$(%b{})', function(parenscontent)
        -- support for conditional expansion:
        local name, subtemplate =
          parenscontent:match '{%s*([%a_][%w_]*)%s*%?%s*(.-)%s*}$'
        if name then
          if type(opts[name]) ~= 'boolean' then
            error(('conditional parameter %q must be passed as a boolean '..
              'option to expand_template'):format(name))
          end
          if not opts[name] then return '' end
          return (rec(subtemplate, mapping, mapping_submappers))
        end

        -- and recursive multi-parameter expansion:
        name, subtemplate =
          parenscontent:match '{%s*([%a_][%w_]*)%s*=%s*(.-)%s*}'
        if name then
          local submappers = {}
          mapping_submappers[name] = submappers

          local buffer = {}
          for i = 1, assert(opts[name .. '_count'],
            'missing corresponding _count key'
          ) do
            buffer[i], submappers[i] = rec(subtemplate)
          end
          local separator = opts[name .. '_separator'] or ','
          return table.concat(buffer, separator)
        end
        error(('unsupported template format %q'):format(parenscontent))
      end)

      -- now replace the names at the current level:
      template = template:gsub('%$([%a_][%w_]*)', function(name)
        if not mapping[name] then
          argcount = argcount + 1
          mapping[name] = argcount
        end
        return '$' .. mapping[name]
      end)

      return template, function(args, row_arguments)
        row_arguments = row_arguments or {}
        if type(args) ~= 'table' then
          assert(mapping._, 'non table arguments can only be used with $_')
          row_arguments[mapping._] = args
          return row_arguments
        end
        for k, value in pairs(args) do
          if mapping[k] then
            row_arguments[mapping[k]] = value
          elseif mapping_submappers[k] then
            local submappers = mapping_submappers[k]
            for i, innervalue in pairs(value) do
              assert(type(i)=='number' and i==math.floor(i)
                and 1<=i and i<=#submappers)
              submappers[i](innervalue, row_arguments)
            end
          else
            error(("argument %q wasn't mentioned at the template"):format(k))
          end
        end
        return row_arguments
      end
    end

    return rec(template)
  end

In order to explain how it works, we show an unrelated example where all the
supported patterns are used, letting us know at all times the exact SQL code
being generated:

  function get_employees(opts)
    -- here we generate the query, based on the parameters in opts:
    -- note that here we know exactly what's being generated, and in
    -- particular we have a strong separation between the query generation
    -- and the query execution.  That's why we never accept the particular
    -- values when expanding the template.
    local sql, mapper = expand_template([[
      SELECT * FROM employees
      WHERE true
        ${has_name? AND (name LIKE $name)}
        ${has_min_age? AND (
            date_of_birth <= now() - '1 year'::interval * $min_age)}
        ${has_roles AND role IN (${roles=$_})}
      ${has_limit? ORDER BY id LIMIT $limit}
    ]], {
      has_name = (opts.name~=nil),
      has_min_age = (opts.min_age~=nil),
      has_roles = (opts.roles~=nil),
      roles_count = opts.roles and #opts.roles,
      has_limit = (opts.limit~=nil),
    })

    -- The mapper is the function that converts the tree of named values
    -- flattening into an array $1..$n of the items.  The expansion of the
    -- prepared statement arguments will be done by the PostgreSQL server.
    return db:mexec(sql, {mapper{
      name = opts.name,
      min_age = opts.min_age,
      roles = opts.roles,
      limit = opts.limit,
    }})[1]
  end

  -- example call:
  local employees = get_employees {
    name = '%smith%',
    min_age = 42,
    roles = {'developer', 'sysadmin'},
  }




* Don't use an ORM.
* If you're going to do batch updates, the simplest safe way is to always
  update with the SELECT FOR UPDATE subquery (don't forget the ORDER BY).
* Doing the SELECT FOR UPDATE does not come for free.



                                                                              π