Remix.run Logo
numpad0 3 days ago

I'm kind of wondering when will it become a universal understanding that LLMs can't be trained with equal amounts of Japanese and Chinese contents in training data due to Han Unification, making these two languages incoherent mix of two conflicting syntax in one. It's remarkable that Latin languages is not apparently facing issues without clear technical explanation as to why, which I'm guessing has to do with the fact of granularity of characters.

That said, in my tiny experience, LLMs all think in their dataset majority language. They don't adhere to prompt languages, one way or another. Chinese models usually think in either English or Chinese, rarely in cursed mix thereof, and never in Japanese or any of their non-native languages.

ehnto 3 days ago | parent | next [-]

Would they not quickly bocome divergent vectors? In the same way that apple and Apple can exist in the same vector set with totally different meanings?

So all information gleaned reading a glyph in the context of japanese articles would be totally different vectors to the information gleaned from the same glyph in Chinese?

numpad0 3 days ago | parent [-]

I don't know, but at least older Qwen models were a bit confused as to what words belong to which languages, and recent ones seem noticeably less sure about ja-JP in general. Maybe it vaguely relates Hanzi/Kanji character being more coarse grained than Latin alphabets so that there aren't enough character counts to tell apart or something.

ACCount37 3 days ago | parent | prev [-]

Why would that be an issue?