|Portada|Blog|Wiki|
image

TLDR: Always use UTF-8 without BOM.  That's it.

What's a BOM? It's a character in Unicode that when placed at the beginning
of a text file can be used to differentiate between UTF-16LE, UTF-16BE,
UTF-32LE and UTF-32BE.  This BE and LE stuff indicates the order in which the
bytes of the characters gets placed, for instance the character 'X' (U+0058)
is encoded as follows:

  +----------+---------------+
  | encoding | byte sequence |
  +----------+---------------+
  | UTF-8    | 58            |
  | UTF-16BE | 00 58         |
  | UTF-16LE | 58 00         |
  | UTF-32BE | 00 00 00 58   |
  | UTF-32LE | 58 00 00 00   |
  +----------+---------------+

So by starting the file with a BOM (U+FEFF) we can detect the endianness
(whether it is BE or LE), and so avoid confusing this 'X' with '堀' (U+5800):

  +----------+-------------------------+
  | encoding | byte sequence           |
  +----------+-------------------------+
  | UTF-8    | EF BB BF 58             |
  | UTF-16BE | FE FF 00 58             |
  | UTF-16LE | FF FE 58 00             |
  | UTF-32BE | 00 00 FE FF 00 00 00 58 |
  | UTF-32LE | FF FE 00 00 58 00 00 00 |
  +----------+-------------------------+

Note, in UTF-32 you can mistake '𐌀' (U+00010300) for '𰄀' (U+00030100) and
viceversa, though it's really unlikely this would ever happen.

Note that there's no concept of endianness in UTF-8, there's no such a thing
as UTF-8BE nor UTF-8LE.

Let's now start with the arguments in favor of using a BOM.

Ok, since there are no arguments in favor, let's proceed with the arguments
against its usage.

image

JSON: RFC-7159 section 8.1:

> Implementations MUST NOT add a byte order mark to the beginning of a
> JSON text.  In the interests of interoperability, implementations
> that parse JSON texts MAY ignore the presence of a byte order mark
> rather than treating it as an error.

shell "sh" scripts require '#!' for the first two bytes of the file:

* https://www.in-ulm.de/~mascheck/various/shebang/
* https://man7.org/linux/man-pages/man2/execve.2.html
* linux in particular: https://bit.ly/36lWOuk

image

One of the nicest characteristics of UTF-8 is that any text containing only
characters from ASCII, it will have the exact same binary representation no
matter whether it is encoded using ASCII or UTF-8.

By introducing a BOM at the beginning, we are inserting three bytes that are
invalid in the context of ASCII.

image

One nice characteristic of UTF-8 is that two files written in UTF-8 can be
concatenated as if they were binary files.

  $ printf foobar | md5sum
  3858f62230ac3c915f300c664312c63f  -

  $ { printf foo; printf bar; } | md5sum
  3858f62230ac3c915f300c664312c63f  -

Guess what could happen if we were to add a couple of BOMs in the mixture!

image

Standard APIs for opening files handle text and binary files equally.
The POSIX standard explicitly says the following for fopen, (and no,
there's no such a thing as O_BINARY for the low levelopen(2)):

> The character 'b' shall have no effect, but is allowed for ISO C standard
> conformance.

This means that BOM handling has to be done by the programs themselves whenever
they need to support those broken text files, here are a few examples just for
loading them:

  * LuaJIT: https://bit.ly/3wcbuXY

>   if (ls->c == 0xef && ls->p + 2 <= ls->pe && (uint8_t)ls->p[0] == 0xbb &&
>       (uint8_t)ls->p[1] == 0xbf) {  /* Skip UTF-8 BOM (if buffered). */
>     ls->p += 2;
>     lex_next(ls);
>     header = 1;
>   }

  * CPython: https://bit.ly/3tYQpxq

>     if (ch1 == EOF) {
>         return 1;
>     } else if (ch1 == 0xEF) {
>         ch2 = get_char(tok);
>         if (ch2 != 0xBB) {
>             unget_char(ch2, tok);
>             unget_char(ch1, tok);
>             return 1;
>         }
>         ch3 = get_char(tok);
>         if (ch3 != 0xBF) {
>             unget_char(ch3, tok);
>             unget_char(ch2, tok);
>             unget_char(ch1, tok);
>             return 1;
>         }
>     } else {
>         unget_char(ch1, tok);
>         return 1;
>     }
>     if (tok->encoding != NULL)
>         PyMem_Free(tok->encoding);
>     tok->encoding = new_string("utf-8", 5, tok);

(yes, as you can see encoding is always going to be utf-8 anyway...)

What would happen if your application could not handle BOMs?  Let's see for
instance tac (reverse cat) on a file with two lines. Surprise! now your BOM is
in the middle:

$ tac pelota | od -t x1c
0000000  77  6f  72  6c  64  0a  ef  bb  bf  68  65  6c  6c  6f  0a
          w   o   r   l   d  \n 357 273 277   h   e   l   l   o  \n
0000017

Let's grep the lines that start with [a-z] using several tools, just to see
all of them "fail" one by one:
  $ grep '^[a-z]' pelota
  world
  $ awk '/^[a-z]/' pelota
  world
  $ sed '/^[a-z]/!d' pelota
  world
  $ perl -ne 'print if /^[a-z]/' < pelota
  world
  $ luajit -e 'for line in io.lines() do
      if line:find"^[a-z]" then print(line) end
    end' < pelota
  world
  $ python3 -c 'import re, sys
        # this also happens with input():
        for line in sys.stdin.readlines():
            if re.match(r"^[a-z]", line):
                print(line, end="")' < pelota
  world
  $ while read L; do [ -z "${L##[a-z]*}" ] && echo "$L"; done < pelota
  world

Add yours in the comments to show how it also "fails".

image

If using BOMs in UTF-8 is so problematic, why was it done in the first place?

Because that BOM allows programs to distinguish between UTF-8 and other legacy
encodings such as latin-1 or cp-1251 which don't support any kind of signatures;
and so a few desktop apps (like Microsoft Notepad) were written that defaulted
to those legacy encodings when the BOM was missing.

Those apps not only did serious damage by generating problematic text files
that broke many tools, but also made other third party apps generate BOMs just
to keep compatibility with them (like "Download as Plain Text" in Google Docs
does).

In the end even Microsoft with Visual Studio Code or PowerShell 6+ changed the
default encoding to UTF-8 *without* BOMs, and so that UTF-8 BOM nonsense should
be stopped as soon as possible.

So please, don't ever add BOMs to UTF-8 files (unless you do it for trolling,
which can be done by executing ":set bomb" in vim before saving).

In future installments I'll write about other weird "features" present in the
world of Unicode like "Interlinear Annotations" and "Variant Selectors",
because obviously we must make sure no human on earth will ever be able to
"properly" display plain text.