Humans are used to ignoring things while LLMs are explicitly trained to pay attention to the entire text.

Humans who haven't been exposed to trick problems or careful wording probably have a hard time, they'll be less confident about ignoring things.

But the LLM should have seen plenty of trick problems as well.

It just doesn't parse as part of the problem. Humans have more options, and room to think. The LLM had to respond.

I'd also like to see how responses were grouped, does it ever refuse, how do refusals get classed, etc. Were they only counting math failures as wrong answers? It has room to be subjective.

▲

Y_Y 4 days ago | parent [-]

> LLMs are explicitly trained to pay attention to the entire text

I'd respectfully disagree on this point. The magic of attention in transformers is the selective attention applied, which ideally only gives significant weight to the tokens relevant to the query.

	▲	mcswell 4 days ago \| parent \| next [-]
		Ideally, yes. But probably because of our world knowledge, we humans know that cat-facts don't affect mathematic facts (unless of course the cat is walking across the keyboard, in which case all bets are off). LLCs don't know that, and perhaps they're trying to figure out some connection by scanning their database for mathematical facts about cats. If they sleep most of the day, how many hours is that? Does that number factor (pardon the pun) into the math problem? What about six-toed cats (which do btw exist)? Spherical cows come up in math and physics, are there triangular cats (since the problem is about triangles)?
	▲	cubefox 4 days ago \| parent \| prev [-]
		This raises the question whether the performance of LLMs with SSM architecture (Mamba) would be different from the Transformer models they tested. Because SSMs do not use attention layers. The model architecture is actually already known to have effects on some tasks. In particular, SSMs are worse than transformers at retrieving specific information from the context window [1], which e.g. reduces their performance on multiple choice benchmarks. Which is a performance difference that isn't reflected in their language modeling ability (perplexity). 1: https://x.com/avivbick/status/1917616943219236881