| ▲ | gmueckl 2 days ago | ||||||||||||||||
This isn't exactly true. Emojis and other symbols introduced new notions like colors that were not present before. I'm no longer certain that it is feasible to handcraft a font thwt contains all the symbols for codepoints affected by color modifiers. Also, 8 bit codepages, for all their problems (a different kind of hell), didn't break the assumption that each character is encoded as one byte. In that way, they didn't break software in interesting ways like UTF-encoded and possibly decomposed Unicode is able to do. Back then, it was something of a blessing at surface level, but the proliferation of string handling code and concepts that assume this one to one mappping just don't fit well with Unicode. And UTF-8 specifically gives the illusion to English speakers that using naive 8 bit string handling works. | |||||||||||||||||
| ▲ | WorldMaker a day ago | parent | next [-] | ||||||||||||||||
> Emojis and other symbols introduced new notions like colors that were not present before. I'm no longer certain that it is feasible to handcraft a font thwt contains all the symbols for codepoints affected by color modifiers. Color modifiers are just ZWJ sequences. Those existed before. The color modifiers themselves are not the most complicated things that get attached to ZWJ sequences among languages that Unicode supports. OpenType today supports color tables that mean most emoji modified by colors aren't "handcrafted" but algorithmically constructed. (As many ligatures and other ZWJ sequences often are.) > Also, 8 bit codepages, for all their problems (a different kind of hell), didn't break the assumption that each character is encoded as one byte. That is broken in other 8-bit codepages as well, it was just seen as an exception/edge case rather than the rule. The big obvious exception has always been \r\n (carriage return then newline), but there's also ^H (control-H) and ^W (control-W) sequences (effectively backspace and delete word), and the entire gamut of things done with ANSI and/or VT100 escape seqences starting with Escape often stylized as ^[. > And UTF-8 specifically gives the illusion to English speakers that using naive 8 bit string handling works. Unless emoji are present, which is one of the great things about emoji and emoji becoming a very common form of punctuation in English text. Naive 8-bit string handling was always wrong. Emoji help make it visible how wrong it was. (In part by doing things other languages do such as ZWJ sequences and having code points out in the Astral Plane and other such features.) | |||||||||||||||||
| |||||||||||||||||
| ▲ | plorkyeran a day ago | parent | prev [-] | ||||||||||||||||
One byte equals one character was already incorrect in the pre-unicode days for east asian languages. UTF-8 is much easier to parse than something like Shift JIS, where splitting a string in between bytes of a codepoint results in a valid but incorrect string. | |||||||||||||||||