Remix.run Logo
numpad0 16 hours ago

IMO, the sin of Unicode is that they didn't just pick local language authorities and gave them standardized concepts like lines and characters, and start-of-language and end-of-language markers.

Lots of Unicode issues come from handling languages that the code is not expecting, and codes currently has no means to select or report quirk supports.

I suppose they didn't like getting national borders involved in technical standardization bit that's just unavoidable. It is getting involved anyway, and these problems are popping up anyway.

kmeisthax 16 hours ago | parent | next [-]

This doesn't self-synchronize. Removing an arbitrary byte from the text stream (e.g. SOL / EOL) will change the meaning of codepoints far away from the site of the corruption.

What it sounds like you want is an easy way for English-language programmers to skip or strip non-ASCII text without having to reference any actual Unicode documentation. Which is a Unicode non-goal, obviously. And also very bad software engineering practice.

I'm also not sure what you're getting at with national borders and language authorities, but both of those were absolutely involved with Unicode and still are.

kevin_thibedeau 16 hours ago | parent | prev | next [-]

> start-of-language and end-of-language markers

Unicode used to have language tagging but they've been (mostly) deprecated:

https://en.wikipedia.org/wiki/Tags_(Unicode_block)

https://www.unicode.org/reports/tr7/tr7-1.html

anonymoushn 16 hours ago | parent [-]

The lack of such markers prevents Unicode from encoding strings of mixed Japanese and Chinese text correctly. Or in the case of a piece of software that must accept both Chinese and Japanese names for different people, Unicode is insufficient for encoding the written forms of the names.

layer8 15 hours ago | parent | prev [-]

I’m working with Word documents in different languages, and few people take the care to properly tag every piece of text with the correct language. What you’re proposing wouldn’t work very well in practice.

The other historical background is that when Unicode was designed, many national character sets and encodings existed, and Unicode’s purpose was to serve as a common superset of those, as otherwise you’d need markers when switching between encodings. So the existing encodings needed to be easily convertible to Unicode (and back), without markers, for Unicode to have any chance of being adopted. This was the value proposition of Unicode, to get rid of the case distinctions between national character sets as much as possible. As a sibling comment notes, originally there were also optional language markers, which however nobody used.