▲ | ehnto 3 days ago | ||||||||||||||||||||||
I was using Qwen3 locally in thinking mode, and noted that even if it is talking to me in Japanese, it is doing it's "thinking" steps in English. Not having a full understanding of how the layers in an LLM handle language connections I can't say for sure, but for a human this would result in subpar outcomes. For example (not actual output): Input: "こにちは"(konichwa) Qwen Thinking: "Ah, the user has said "こにちは", I should respond in a kind and friendly manner. Qwen Output: こにちは! It quiiiickly gets confused in this, much quicker than in English. | |||||||||||||||||||||||
▲ | numpad0 3 days ago | parent | next [-] | ||||||||||||||||||||||
I'm kind of wondering when will it become a universal understanding that LLMs can't be trained with equal amounts of Japanese and Chinese contents in training data due to Han Unification, making these two languages incoherent mix of two conflicting syntax in one. It's remarkable that Latin languages is not apparently facing issues without clear technical explanation as to why, which I'm guessing has to do with the fact of granularity of characters. That said, in my tiny experience, LLMs all think in their dataset majority language. They don't adhere to prompt languages, one way or another. Chinese models usually think in either English or Chinese, rarely in cursed mix thereof, and never in Japanese or any of their non-native languages. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | charlieyu1 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
I don’t think this can be solved until there is massive investment to train LLM in native Japanese. The current ChatGPT tokenizer still use BPE and you can’t even present a Japanese character with a single token | |||||||||||||||||||||||
▲ | lmm 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Perhaps it knows most users who misspell こんにちは are English speakers? | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | ACCount37 3 days ago | parent | prev [-] | ||||||||||||||||||||||
Quite a few reasoning LLMs do reasoning in English only. Because the RL setup specifically forces them to do so. Why? Because the creators want the reasoning trace to be human readable. And without a pressure forcing them to think in English, they tend to get weird with the reasoning trace. Wild language-mixing, devolved grammar, strange language-mixed nonsense words that the LLM itself seemingly understands just fine. |