I have learned to take these kinds of papers with a grain of salt, though. They often rest on carefully selected examples that make the behavior seem much more consistent and reliable than it is. For example, the famous "king - man + woman = queen" example from Word2Vec is in some ways more misleading than helpful, because while it worked fine for that case it doesn't necessarily work nearly so well for [emperor, man, woman, empress] or [husband, man, woman, wife].

You get a similar thing with convolutional neural networks. Sometimes they automatically learn image features in a way that yields hidden layers that easy and intuitive to interpret. But not every time. A lot of the time you get a seemingly random garble that belies any parsimonious interpretation.

This Anthropic paper is at least kind enough to acknowledge this fact when they poke at the level of representation sharing and find that, according to their metrics, peak feature-sharing among languages is only about 30% for English and French, two languages that are very closely aligned. Also note that this was done using two cherry-picked languages and a training set that was generated by starting with an English language corpus and then translating it using a different language model. It's entirely plausible that the level of feature-sharing would not be nearly so great if they had used human-generated translations. (edit: Or a more realistic training corpus that doesn't entirely consist of matched translations of very short snippets of text.)

Just to throw even more cold water on it, this also doesn't necessarily mean that the models are building a true semantic model and not just finding correlations upon which humans impose semantic interpretations. This general kind of behavior when training models on cross-lingual corpora generated using direct translations was first observed in the 1990s, and the model in question was singular value decomposition.

▲

jiggawatts 2 months ago | parent [-]

I’m convinced that language sharing can be encouraged during training by rewarding correct answers to questions that can only be answered based on synthetic data in another language fed in during a previous pretraining phase.

Interleave a few phases like that and you’d force the model to share abstract information across all languages, not just for the synthetic data but all input data.

I wouldn’t be surprised if this improved LLM performance by another “notch” all by itself, especially for non-English users.

	▲	nenaoki 2 months ago \| parent [-]
		your shrewd idea might make a fine layer back up the Tower of Babel