Remix.run Logo
rmunn 3 days ago

That's not what I'm saying at all, I'm saying that in the absence of a BOM header a Unicode-aware app should guess UTF-8 first and then guess other likely encodings second, because the chance of false positives on the "is this UTF-8?" guess is practically indistinguishable from zero. If it isn't UTF-8, the UTF-8 parsing attempt is nearly guaranteed to fail, so it's safe to do first.

I'm also saying that apps should not create a BOM header any more (in UTF-8 only, not in UTF-16 where it's required), because the costs of dealing with BOM headers are higher than they're worth. Except in certain specific circumstances, like having to deal with pre-Unicode apps that default to assuming 8-bit encodings.

mikelabatt 2 days ago | parent [-]

Makes sense, thank you. The observation about false positives for UTF-8 tending to zero helps understand. So I will vote for UTF-8 without BOM from now on (while encouraging parsers to deal with it, if present).