Remix.run Logo
senkora a day ago

Hangul is great for computer-entry, but the data representation is a little tricky, because syllables are treated as a single glyph and there are many syllables.

I found this old comment that explains it better than I can: https://news.ycombinator.com/item?id=28287811

lgessler a day ago | parent | next [-]

This has led to work showing that models can do better sometimes if you decompose these into their constituent characters, e.g.: https://aclanthology.org/2022.emnlp-main.472.pdf

bobthepanda 21 hours ago | parent [-]

A paper on Korean where the main acronym is BTS has got to be intentional, right?

lgessler 17 hours ago | parent [-]

> We hope that our BTS will light the way up like dynamite[9] for future research on Korean NLP.

kijin a day ago | parent | prev | next [-]

The data representation is fairly straightforward once you're familiar with the composition rules, at least for modern Korean.

Unicode simply lists all possible combinations in dictionary order starting from U+AC00. So you can take any code point and split out the 초성, 중성 and 종성 using simple arithmetic, just like you can figure out Latin alphabets from their ASCII codes.

hyeonwho4 20 hours ago | parent [-]

초성 = initial sound (consonant) 중성 = middle sound (vowel) 종성 = final sound (consonant)

My understanding is that there are two possible unicode encodings of Korean, one of which (MacOS) is sound by sound instead of syllable by syllable (Windows). This is why Korean UTF-8 filenames from MacOS appear broken on modern Windows machines.

kijin 14 hours ago | parent [-]

Yeah, it's stupid that Windows can't normalize the two completely valid ways of expressing Hangul in Unicode. If they can process e + acute accent = é, they should be able to do ㄱ + ㅏ = 가.

Having said that, MacOS also made the strange choice of expressing Hangul using the Hangul Jamo (by sound) Unicode block even when there are equivalent precomposed symbols in the Hangul Syllables block. Encoding each sound individually takes up 2-3 times more storage, just like with accented characters in Latin. Besides, if you just list sounds and rely on them to be combined automatically, what do you do when you legitimately want to write a sequence of uncombined sounds, like ㄱㅏㅁ instead of 감?

samatman 19 hours ago | parent | prev [-]

Rare WalterBright L taken in that thread.

Sure, Unicode isn't the Platonic ideal of a character encoding. It has warts, legacy features, and.. and it is a universal encoding of all human writing. What an exceptional and incredible accomplishment.

Could you replace it with something better designed?

No. No, you cannot. You can in principle design something better, but that's a completely different, quixotic, and useless task.

It's also far from impossible to implement Unicode 'correctly', folks not only can, but do, routinely. It's extensively well documented, there's example code, it's just work.

Also, if your game plan for Unicode-D includes removing the most beloved and consistently demanded feature, emoji: then no, that person in particular is not capable even in principle of designing something better. That game has been lost before it began.