▲ | linguae 4 days ago | |||||||
I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences). UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake! | ||||||||
▲ | layer8 4 days ago | parent | next [-] | |||||||
On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la... | ||||||||
▲ | 4 days ago | parent | prev | next [-] | |||||||
[deleted] | ||||||||
▲ | pezezin 4 days ago | parent | prev [-] | |||||||
I live in Japan and I still receive the random email or work document encoded in Shit-JIS. Mojibake is not as common as it once was, but still a problem. | ||||||||
|