Remix.run Logo
WorldMaker 2 days ago

Rendering Unicode was always this complex. Emoji don't do anything that some other language in real use doesn't also do. What emoji does is bring that visually to the forefront among contemporary English text. The assumption that 8-bit character sets of simple bitmaps are all you need mostly only ever worked for English (and then only if you didn't need nice print-like typography, or math formulas, or…).

gmueckl 2 days ago | parent [-]

This isn't exactly true. Emojis and other symbols introduced new notions like colors that were not present before. I'm no longer certain that it is feasible to handcraft a font thwt contains all the symbols for codepoints affected by color modifiers.

Also, 8 bit codepages, for all their problems (a different kind of hell), didn't break the assumption that each character is encoded as one byte. In that way, they didn't break software in interesting ways like UTF-encoded and possibly decomposed Unicode is able to do. Back then, it was something of a blessing at surface level, but the proliferation of string handling code and concepts that assume this one to one mappping just don't fit well with Unicode. And UTF-8 specifically gives the illusion to English speakers that using naive 8 bit string handling works.

WorldMaker a day ago | parent | next [-]

> Emojis and other symbols introduced new notions like colors that were not present before. I'm no longer certain that it is feasible to handcraft a font thwt contains all the symbols for codepoints affected by color modifiers.

Color modifiers are just ZWJ sequences. Those existed before. The color modifiers themselves are not the most complicated things that get attached to ZWJ sequences among languages that Unicode supports.

OpenType today supports color tables that mean most emoji modified by colors aren't "handcrafted" but algorithmically constructed. (As many ligatures and other ZWJ sequences often are.)

> Also, 8 bit codepages, for all their problems (a different kind of hell), didn't break the assumption that each character is encoded as one byte.

That is broken in other 8-bit codepages as well, it was just seen as an exception/edge case rather than the rule. The big obvious exception has always been \r\n (carriage return then newline), but there's also ^H (control-H) and ^W (control-W) sequences (effectively backspace and delete word), and the entire gamut of things done with ANSI and/or VT100 escape seqences starting with Escape often stylized as ^[.

> And UTF-8 specifically gives the illusion to English speakers that using naive 8 bit string handling works.

Unless emoji are present, which is one of the great things about emoji and emoji becoming a very common form of punctuation in English text. Naive 8-bit string handling was always wrong. Emoji help make it visible how wrong it was. (In part by doing things other languages do such as ZWJ sequences and having code points out in the Astral Plane and other such features.)

gmueckl a day ago | parent [-]

So you agree that font rendering had to be extended to support color modifiers as specified in Unicode? That is the kind of completely creep that I am pointing out.

A bunch of control codes are historically part of character encodings, and their encoding is very consistent within codepages of the same family (ASCII/ANSI and EBCDIC). You don't have to have any awareness about the active codepage/language to handle them correctly.

Terminal escape sequences are a poor form of in-band signaling between devices (now virtualized), not text. I comsider that out of scope.

Anyway, as we get into the weeds here, I do not want to dispute the enormous practical utility of Unicode and I am glad that it exists and covers so many of the world's writing systems and alphabets. It is one of the central standards that connects people today. But from the purely technical perspective, the steady complexity creep is undeniable and brings somewhat hidden costs to software systems.

WorldMaker 15 hours ago | parent [-]

Font rendering has always had to adapt to colors. There were (non-standard) multi-color fonts in the 8-bit era, too. It's not just emoji that are using OpenType color tables and OpenType color tables are not the only way that colors in emoji have been handled, just one of the most standardized/algorithmic methods. Part of why they were standardized is how much they resemble other tables needed for complex ligatures in other languages that aren't emoji. Emoji didn't invent complex ligatures with multi-table lookup to render. Emoji benefit from simple extensions of the exact same mechanic. It's a new table, yes, but it is not a complex table, it's one of the simplest tables in OpenType.

plorkyeran a day ago | parent | prev [-]

One byte equals one character was already incorrect in the pre-unicode days for east asian languages. UTF-8 is much easier to parse than something like Shift JIS, where splitting a string in between bytes of a codepoint results in a valid but incorrect string.