TLDR: Always use UTF-8 without BOM. That's it. What's a BOM? It's a character in Unicode that when placed at the beginning of a text file can be used to differentiate between UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE. This BE and LE stuff indicates the order in which the bytes of the characters gets placed, for instance the character 'X' (U+0058) is encoded as follows: +----------+---------------+ | encoding | byte sequence | +----------+---------------+ | UTF-8 | 58 | | UTF-16BE | 00 58 | | UTF-16LE | 58 00 | | UTF-32BE | 00 00 00 58 | | UTF-32LE | 58 00 00 00 | +----------+---------------+ So by starting the file with a BOM (U+FEFF) we can detect the endianness (whether it is BE or LE), and so avoid confusing this 'X' with '堀' (U+5800): +----------+-------------------------+ | encoding | byte sequence | +----------+-------------------------+ | UTF-8 | EF BB BF 58 | | UTF-16BE | FE FF 00 58 | | UTF-16LE | FF FE 58 00 | | UTF-32BE | 00 00 FE FF 00 00 00 58 | | UTF-32LE | FF FE 00 00 58 00 00 00 | +----------+-------------------------+ Note, in UTF-32 you can mistake '𐌀' (U+00010300) for '𰄀' (U+00030100) and viceversa, though it's really unlikely this would ever happen. Note that there's no concept of endianness in UTF-8, there's no such a thing as UTF-8BE nor UTF-8LE. Let's now start with the arguments in favor of using a BOM. Ok, since there are no arguments in favor, let's proceed with the arguments against its usage. JSON: RFC-7159 section 8.1: > Implementations MUST NOT add a byte order mark to the beginning of a > JSON text. In the interests of interoperability, implementations > that parse JSON texts MAY ignore the presence of a byte order mark > rather than treating it as an error. shell "sh" scripts require '#!' for the first two bytes of the file: * https://www.in-ulm.de/~mascheck/various/shebang/ * https://man7.org/linux/man-pages/man2/execve.2.html * linux in particular: https://bit.ly/36lWOuk One of the nicest characteristics of UTF-8 is that any text containing only characters from ASCII, it will have the exact same binary representation no matter whether it is encoded using ASCII or UTF-8. By introducing a BOM at the beginning, we are inserting three bytes that are invalid in the context of ASCII. One nice characteristic of UTF-8 is that two files written in UTF-8 can be concatenated as if they were binary files. $ printf foobar | md5sum 3858f62230ac3c915f300c664312c63f - $ { printf foo; printf bar; } | md5sum 3858f62230ac3c915f300c664312c63f - Guess what could happen if we were to add a couple of BOMs in the mixture! Standard APIs for opening files handle text and binary files equally. The POSIX standard explicitly says the following for fopen, (and no, there's no such a thing as O_BINARY for the low levelopen(2)): > The character 'b' shall have no effect, but is allowed for ISO C standard > conformance. This means that BOM handling has to be done by the programs themselves whenever they need to support those broken text files, here are a few examples just for loading them: * LuaJIT: https://bit.ly/3wcbuXY > if (ls->c == 0xef && ls->p + 2 <= ls->pe && (uint8_t)ls->p[0] == 0xbb && > (uint8_t)ls->p[1] == 0xbf) { /* Skip UTF-8 BOM (if buffered). */ > ls->p += 2; > lex_next(ls); > header = 1; > } * CPython: https://bit.ly/3tYQpxq > if (ch1 == EOF) { > return 1; > } else if (ch1 == 0xEF) { > ch2 = get_char(tok); > if (ch2 != 0xBB) { > unget_char(ch2, tok); > unget_char(ch1, tok); > return 1; > } > ch3 = get_char(tok); > if (ch3 != 0xBF) { > unget_char(ch3, tok); > unget_char(ch2, tok); > unget_char(ch1, tok); > return 1; > } > } else { > unget_char(ch1, tok); > return 1; > } > if (tok->encoding != NULL) > PyMem_Free(tok->encoding); > tok->encoding = new_string("utf-8", 5, tok); (yes, as you can see encoding is always going to be utf-8 anyway...) What would happen if your application could not handle BOMs? Let's see for instance tac (reverse cat) on a file with two lines. Surprise! now your BOM is in the middle: $ tac pelota | od -t x1c 0000000 77 6f 72 6c 64 0a ef bb bf 68 65 6c 6c 6f 0a w o r l d \n 357 273 277 h e l l o \n 0000017 Let's grep the lines that start with [a-z] using several tools, just to see all of them "fail" one by one: $ grep '^[a-z]' pelota world $ awk '/^[a-z]/' pelota world $ sed '/^[a-z]/!d' pelota world $ perl -ne 'print if /^[a-z]/' < pelota world $ luajit -e 'for line in io.lines() do if line:find"^[a-z]" then print(line) end end' < pelota world $ python3 -c 'import re, sys # this also happens with input(): for line in sys.stdin.readlines(): if re.match(r"^[a-z]", line): print(line, end="")' < pelota world $ while read L; do [ -z "${L##[a-z]*}" ] && echo "$L"; done < pelota world Add yours in the comments to show how it also "fails". If using BOMs in UTF-8 is so problematic, why was it done in the first place? Because that BOM allows programs to distinguish between UTF-8 and other legacy encodings such as latin-1 or cp-1251 which don't support any kind of signatures; and so a few desktop apps (like Microsoft Notepad) were written that defaulted to those legacy encodings when the BOM was missing. Those apps not only did serious damage by generating problematic text files that broke many tools, but also made other third party apps generate BOMs just to keep compatibility with them (like "Download as Plain Text" in Google Docs does). In the end even Microsoft with Visual Studio Code or PowerShell 6+ changed the default encoding to UTF-8 *without* BOMs, and so that UTF-8 BOM nonsense should be stopped as soon as possible. So please, don't ever add BOMs to UTF-8 files (unless you do it for trolling, which can be done by executing ":set bomb" in vim before saving). In future installments I'll write about other weird "features" present in the world of Unicode like "Interlinear Annotations" and "Variant Selectors", because obviously we must make sure no human on earth will ever be able to "properly" display plain text.