| ▲ | mort96 3 hours ago | |||||||||||||
UTF-8 does not encode "European glyphs" in two bytes, no. Most European languages use variations of the latin alphabet, meaning most glyphs in European languages use the 1-byte ASCII subset of UTF-8. The occasional non-ASCII glyph becomes two bytes, that's correct, but that's a much smaller bloat than what you imply. Anyway, what are you comparing it to, what is your preferred alternative? Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information and you can't mix languages in a text file? Or do you prefer using UTF-16, where all of ASCII is 2 bytes per character but you get a marginal benefit for Han texts? | ||||||||||||||
| ▲ | thaumasiotes 3 hours ago | parent [-] | |||||||||||||
> Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information? Yes. Note that this is already how Unicode is supposed to work. See e.g. https://en.wikipedia.org/wiki/Byte_order_mark . A file isn't meaningful unless you know how to interpret it; that will always be true. Assuming that all files must be in a preexisting format defeats the purpose of having file formats. > Most European languages use variations of the latin alphabet If you want to interpret "variations of Latin" really, really loosely, that's true. Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters. This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte. | ||||||||||||||
| ||||||||||||||