▲ | mikelabatt 4 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nice article, thank you. I love UTF-8, but I only advocate it when used with a BOM. Otherwise, an application may have no way of knowing that it is UTF-8, and that it needs to be saved as UTF-8. Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters) The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates. So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | rmunn 4 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The fact that you advocate using a BOM with UTF-8 tells me that you run Windows. Any long-term Unix user has probably seen this error message before (copy and pasted from an issue report I filed just 3 days ago):
If you've got any experience with Linux, you probably suspect the problem already. If your only experience is with Windows, you might not realize the issue. There's an invisible U+FEFF lurking before the `#!`. So instead of that shell script starting with the `#!` character pair that tells the Linux kernel "The application after the `#!` is the application that should parse and run this file", it actually starts with `<FEFF>#!`, which has no meaning to the kernel. The way this script was invoked meant that Bash did end up running the script, with only one error message (because the line did not start with `#` and therefore it was not interpreted as a Bash comment) that didn't matter to the actual script logic.This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons). What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible. But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | taffer 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I respectfully disagree. The BOM is a Windows-specific idiosyncrasy resulting from its early adoption of UTF-16. In the Unix world, a BOM is unexpected and causes problems with many programs, such as GCC, PHP and XML parsers. Don't use it! The correct approach is to use and assume UTF-8 everywhere. 99% of websites use UTF-8. There is no reason to break software by adding a BOM. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | Cloudef 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BOM is awful as it breaks concatenation. In modern world everything should be just assumed to be UTF8 by default. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | cryptonector 4 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
You do not need a BOM for UTF-8. Ever. Byte order issues are not a problem for UTF-8 because UTF-8 is manipulated as a string of _bytes_, not as a string of 16-bit or 32-bit code units. You _do_ need a BOM for UTF-16 and UTF-32. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|