Remix.run Logo
lifthrasiir 3 days ago

Eh, Han unification was an one-off decision. Now many (but not all) characters have been disunified as needed, like the infamous Biang character [1] which received two different code points. Of course common characters are much less likely to be disunified, because at this point many decades have been passed after the initial encoding and any disunification would cause compatibility issues.

[1] https://en.wikipedia.org/wiki/Biangbiang_noodles#Unicode

numpad0 3 days ago | parent [-]

It's an upheld decision. The unification is not about reducing character counts overall, but to co-mingle CJKV languages. Adding more characters is not un-mingling existing characters.

One thing I feared might happen and do seem to be happening is, Chinese LLMs and AI projects seem to be moving towards Chinese-English bilingual models away from regular omni-lingual models, which, I think is, because LLMs would become confused with Chinese-invalid syntaxes and dictionary definitions, and/or generally perform worse, if substantial non-Chinese CJKV data was included in the dataset.

At the polar opposite of computing, Hollow Knight: Sliksong released just days prior is having Han Unification font problem as well: as you might know, thanks to Han Unification, CJKV languages each require its own font, of which no two cannot be active at the same time, and characters become mangled if application developer spends substantial cost implementing such non-standard feature. The developers was not aware of that, and did not spend extra cost doing so, and is getting review bombed in China.

It just needs to be reversed. It's a real problem. Adding more obscure characters and obscure features is tangential and not a solution. Different isolated clusters of characters uses need to be separated, not overlapped into one same area, like there are no "GermanFrench-English dictionary".

lifthrasiir 3 days ago | parent | next [-]

The unification, implemented in Unicode 1.1, is definitely a character count reduction mechanism. I'm very sure that if the decision to abandon 16-bit character set was done earlier then the unification wouldn't have happened.

And I'm saying this as a CJKV person and past gamedev: CJKV languages each require its own font no matter whether the Han unification is implemented or not. There are simply too many glyphs there; not just unified characters, but also common characters that are not considered unified are also often varying across countries. If you account for all those glyph variations in a single font, you just can't cope up because OpenType only supports at most 65,536 glyphs in a single typeface. In the alternative universe OpenType may have been extended to allow more glyphs in a single typeface, I don't know, but CJKV characters are simply complex enough to require multiple font files in general. Han unification is of less concern when you have too many glyphs.

numpad0 3 days ago | parent [-]

> not just unified characters, but also common characters that are not considered unified are also often varying across countries.

That's the unification, the issues stemming from CJKVs each not having own code points. The issue is not that CJKVs need multiple font files and it's cumbersome, the issue is that no two CJKV fonts may be loaded at the same time because there are conflicting glyphs. Conflicting glyphs. That's just wrong.

lifthrasiir 3 days ago | parent | next [-]

If you somehow want to display, say, both Japanese and Chinese texts at the same time, there is no technical obstacle that prevents you to do so. Pan-Unicode fonts come with differently named files for CJKV characters so that is not even difficult. Yes, your assets will have multiple multi-megabyte font files. Is that a problem for modern games? I don't think so.

There is a single circumstance where this is not generally doable: a user name in globally serviced online games. (Guess why I know of this case...) Unless there is a hint that a particular user prefers one's user name to be displayed in a certain way, it is difficult to decide which font to use (or even which set of fonts to use). But it's a very niche problem and otherwise you know which language of the text you are showing and can pick the correct font from your assets.

numpad0 2 days ago | parent [-]

What you've said is correct, but it also means Unicode strings containing CJKV characters become mildly corrupt if decoded without a "--interpret-as=<language>" option to change binary-glyph correspondence. That's just not what Unicode should stand for.

You should not need to keep or infer the language hint. I know it was always the officially sanctioned way and what developer engaged in i18n work has to live with. My point is NOT that you are wrong but that part of Unicode spec is wrong.

zahlman 2 days ago | parent | prev [-]

> Conflicting glyphs.

Which could be chosen between using variation selectors.

numpad0 2 days ago | parent [-]

I guess, but I've never heard there's a `cat text | ivs-convert --from=utf8 --to=zh-Hans` type of things. So practically almost non-existent.

eviks 3 days ago | parent | prev [-]

> like there are no "GermanFrench-English dictionary"

But there is a single Latin alphabet

numpad0 3 days ago | parent | next [-]

Except there are standalone Greek/Coptic as well as Cyrillic ranges in Unicode. Latin A, Greek A, and Russian A each has its own versions in Unicode, so that Latin or Russian fonts don't have to be deleted from apps and operating systems configured for Greek usage to show Greek :alpha: consistent with Greek :phi: without getting it substituted by lowercase Latin `a`.

eviks 3 days ago | parent [-]

> standalone Greek/Coptic

So? Unicode isn't in either of the extremes, so it didn't unify Latin and Greek (using language tags to differentiate), but then it also didn't separate German and French, so your GermanFrench dictionary still falls flat, it's doesn't help in picking the dividing line

account42 3 days ago | parent | prev [-]

Unifying I and I was also a mistake although that one at least preceded Unicode.