Remix.run Logo
linguae 4 days ago

I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!

layer8 4 days ago | parent | next [-]

On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...

4 days ago | parent | prev | next [-]
[deleted]
pezezin 4 days ago | parent | prev [-]

I live in Japan and I still receive the random email or work document encoded in Shit-JIS. Mojibake is not as common as it once was, but still a problem.

rmunn 4 days ago | parent [-]

I'm assuming you misspelled Shift-JIS on purpose because you're sick and tired of dealing with it. If that was an accidental misspelling, it was inspired. :-)