I somewhat agree, but I think that the language example is not a good one. As Anthropic have demonstrated[0], LLMs do have "conceptual neurons" that generalise an abstract concept which can later be translated to other languages.

The issue is that those concepts are encoded in intermediate layers during training, absorbing biases present in training data. It may produce a world model good enough to know that "green" and "verde" are different names for the same thing, but not robust enough to discard ordering bias or wording bias. Humans suffer from that too, albeit arguably less.

[0] https://transformer-circuits.pub/2025/attribution-graphs/bio...

▲

bunderbunder a month ago | parent | next [-]

I have learned to take these kinds of papers with a grain of salt, though. They often rest on carefully selected examples that make the behavior seem much more consistent and reliable than it is. For example, the famous "king - man + woman = queen" example from Word2Vec is in some ways more misleading than helpful, because while it worked fine for that case it doesn't necessarily work nearly so well for [emperor, man, woman, empress] or [husband, man, woman, wife].

You get a similar thing with convolutional neural networks. Sometimes they automatically learn image features in a way that yields hidden layers that easy and intuitive to interpret. But not every time. A lot of the time you get a seemingly random garble that belies any parsimonious interpretation.

This Anthropic paper is at least kind enough to acknowledge this fact when they poke at the level of representation sharing and find that, according to their metrics, peak feature-sharing among languages is only about 30% for English and French, two languages that are very closely aligned. Also note that this was done using two cherry-picked languages and a training set that was generated by starting with an English language corpus and then translating it using a different language model. It's entirely plausible that the level of feature-sharing would not be nearly so great if they had used human-generated translations. (edit: Or a more realistic training corpus that doesn't entirely consist of matched translations of very short snippets of text.)

Just to throw even more cold water on it, this also doesn't necessarily mean that the models are building a true semantic model and not just finding correlations upon which humans impose semantic interpretations. This general kind of behavior when training models on cross-lingual corpora generated using direct translations was first observed in the 1990s, and the model in question was singular value decomposition.

▲

jiggawatts a month ago | parent [-]

I’m convinced that language sharing can be encouraged during training by rewarding correct answers to questions that can only be answered based on synthetic data in another language fed in during a previous pretraining phase.

Interleave a few phases like that and you’d force the model to share abstract information across all languages, not just for the synthetic data but all input data.

I wouldn’t be surprised if this improved LLM performance by another “notch” all by itself, especially for non-English users.

	▲	nenaoki a month ago \| parent [-]
		your shrewd idea might make a fine layer back up the Tower of Babel

▲

nowittyusername a month ago | parent | prev [-]

I've read the paper before I made the statement. And I still made the statement because there are issues with their paper. The first problem is that the way in which anthropic trains their models and the architecture of their models is different from most of the open source models people use. they are still transformer based, but they are not structurally put together the same as most models, so you cant extrapolate their findings on their models to other models. Their training methods also use a lot more regularization of the data trying to weed out targeted biases as much as possible. meaning that the models are trained on more synthetic data which tries to normalize the data as much as possible between languages, tone, etc.. Same goes for their system prompt, their system prompt is treated differently versus open source models which append the system prompt in front of the users query internally. The attention ais applied differently among other things. Second the way that their models "internalize" the world is vastly different then what humans would thing of "building a world model" of reality. Its hard to put it in to words but basically their models do have a underlying representative structure but its not anything that would be of use in the domains humans care about, "true reasoning". Grokking the concept if you will. Honestly I highly suggest folks take a lot of what anthropic studies with a grain of salt. I feel that a lot of information they present is purposely misinterpreted by their teams for media or pr/clout or who knows what reasons. But the biggest reason is the one i stated at the beginning, most models are not of the same ilk as Anthropic models. I would suggest folks focus on reading interpretability research on open source models as those are most likely to be used by corporations for their cheap api costs. And those models have no where near the care and sophistication put in to them as anthropic models.

	▲	bunderbunder a month ago \| parent [-]
		> I feel that a lot of information they present is purposely misinterpreted by their teams for media or pr/clout or who knows what reasons. I think it's just the culture of machine learning research at this point. Academics are better about it, but still far from squeaky clean. It can't be squeaky clean, because if you aren't willing to make grand overinflated claims to help attract funding, someone else will be, and they'll get the research funding, so they'll be the ones who get to publish research. It's like an anthropic principle of AI research. (rimshot)