▲ | psidium 3 days ago | |||||||
I don’t have the data but I assume the corpus available to train an LLM is majorly in English, written by Americans and western counterparts. If we’re training the LLMs to sound similar to the training data, I imagine the responses have to match that world view. My anecdote is that before LLMs I would default to search Google in English instead of my own native language simply because there was so much more content in English to be found that would help me. And here I am producing novel sentences in English to respond to your message, further continuing the cycle where English is the main language to search and do things. | ||||||||
▲ | tropdrop 3 days ago | parent | next [-] | |||||||
In my experience, ChatGPT, at least, seems to have had multiple languages used to train its corpus. I am guessing this based on its interaction with me in a different language, where it changed English idioms like "short and sweet" to analogous versions in that language that were not direct translations. But my guess is that the data sets used from the other languages are smaller (and actually, even if it had perfect access to every single piece of data on the internet, that would still be true, due to the astonishing quantity of English-language data out there compared to the rest. Your comment validates that). With less data, one would expect a poorer performance in all metrics for any non-Anglophone place, including the "cultural world view" metric. | ||||||||
▲ | klooney 3 days ago | parent | prev | next [-] | |||||||
And the RHLF was directed by Californians, and so the "values" are likely very California. | ||||||||
▲ | DaveZale 3 days ago | parent | prev [-] | |||||||
english is the lingua franca ;-) | ||||||||
|