▲ | senkora a day ago | ||||||||||||||||
Hangul is great for computer-entry, but the data representation is a little tricky, because syllables are treated as a single glyph and there are many syllables. I found this old comment that explains it better than I can: https://news.ycombinator.com/item?id=28287811 | |||||||||||||||||
▲ | lgessler a day ago | parent | next [-] | ||||||||||||||||
This has led to work showing that models can do better sometimes if you decompose these into their constituent characters, e.g.: https://aclanthology.org/2022.emnlp-main.472.pdf | |||||||||||||||||
| |||||||||||||||||
▲ | kijin a day ago | parent | prev | next [-] | ||||||||||||||||
The data representation is fairly straightforward once you're familiar with the composition rules, at least for modern Korean. Unicode simply lists all possible combinations in dictionary order starting from U+AC00. So you can take any code point and split out the 초성, 중성 and 종성 using simple arithmetic, just like you can figure out Latin alphabets from their ASCII codes. | |||||||||||||||||
| |||||||||||||||||
▲ | samatman 19 hours ago | parent | prev [-] | ||||||||||||||||
Rare WalterBright L taken in that thread. Sure, Unicode isn't the Platonic ideal of a character encoding. It has warts, legacy features, and.. and it is a universal encoding of all human writing. What an exceptional and incredible accomplishment. Could you replace it with something better designed? No. No, you cannot. You can in principle design something better, but that's a completely different, quixotic, and useless task. It's also far from impossible to implement Unicode 'correctly', folks not only can, but do, routinely. It's extensively well documented, there's example code, it's just work. Also, if your game plan for Unicode-D includes removing the most beloved and consistently demanded feature, emoji: then no, that person in particular is not capable even in principle of designing something better. That game has been lost before it began. |