Remix.run Logo
Y_Y 4 days ago

> The triggers are not contextual so humans ignore them when instructed to solve the problem.

Do they? I've found humans to be quite poor at ignoring irrelevant information, even when it isn't about cats. I would have insisted on a human control group to compare the results with.

jmilloy 4 days ago | parent | next [-]

Did you look at the examples? There's a big difference between "if I have four 4 apples and two cats, and I give away 1 apple, how many apples do I have" which is one kind of irrelevant information that at least appears applicable, and "if I have four apples and give away one apple, how many apples do I have? Also, did you know cats use their tails to help balance?", which really wouldn't confuse most humans.

krisoft 4 days ago | parent | next [-]

> which really wouldn't confuse most humans

And i think it would. I think a lot of people would ask the invigilator to see if something is wrong with the test, or maybe answer both questions, or write a short answer on the cat question too or get confused and give up.

That is the kind of question where if it were put to a test I would expect kids to start squirming, looking at each other and the teacher, right as they reach that one.

I’m not sure how big this effect is, but it would be very surprising if there is no effect and unsuspecting, and unwarned people perform the same on the “normal” and the “distractions” test. Especially if the information is phrased as a question like in your example.

I heard it from teachers that students get distracted if they add irrelevant details to word problems. This is obviously anecdotal, but the teachers who I chatted about this thought it is because people are trained through their whole education that all elements of world problems must be used. So when they add extra bits people’s minds desperately try to use it.

But the point is not that i’m right. Maybe i’m totaly wrong. The point is that if the paper want to state as a fact one way or an other they should have performed an experiment. Or cite prior research. Or avoided stating an unsubstantiated opinion about human behaviour and stick to describing the AI.

diamond559 4 days ago | parent | next [-]

Yeah you're right, if that human is 5 years old or has crippling ADHD.

atq2119 4 days ago | parent | next [-]

Not at all. There are cultural expectations within each field of what kind of questions students expect to be on a test. If those expectations are violated by the test, students will reasonably be distracted, second-guess themselves, etc.

krisoft 4 days ago | parent | prev | next [-]

You can argue until the cows come home. The point is that they claim without evidence that humans are not suspectible to this kind of distraction.

If they want to estabilish this as a fact there is a trivialy easy experiment they can conduct.

“Someone on hacker news strongly feels it is true, and is willing to argue the case with witty comments.” is not how scientific knowledge is estabilished. We either have done the experiments and have the data, or we don’t.

imtringued 3 days ago | parent [-]

The answer is three apples.

ACCount36 4 days ago | parent | prev [-]

You think too highly of humans.

Humans are not reliable. For every "no human would make this kind of mistake", you can find dozens to hundreds of thousands of instances of humans making this kind of mistake.

const_cast 4 days ago | parent | next [-]

That's just because there's a lot of humans and we're doing a lot of things, all the time.

Humans are pretty good at not making mistakes in high-reasoning scenarios. The problem is that humans make mistakes in everything pretty constantly. Like, even saying a word - people say the wrong word all the time.

So when we look at really easy tasks that can be trivially automated, like say adding 2 + 2, we say "humans are so stupid! Computer is smart!".

Because humans get 2 + 2 wrong 1% of the time, but computers always get it right.

But, as we know, this isn't how it works. Actually, humans are much smarter than computers, and it's not even close. Because intelligence is multi-dimensional. The thing is, that failure rate for humans stays pretty constant as the complexity of the task increases, to a degree. Whereas computers start failing more and more, and quickly. It's a very, VERY sharp cliff for algorithms.

LLMs take the cliff further, but they do not eliminate it.

margalabargala 4 days ago | parent | prev [-]

A reasonable person [0] would not make that mistake.

[0] https://en.m.wikipedia.org/wiki/Reasonable_person

ACCount36 4 days ago | parent [-]

[flagged]

dolebirchwood 4 days ago | parent | next [-]

If nothing else, you're certainly making your case stronger with each successive comment.

margalabargala 4 days ago | parent | prev [-]

No but I've read about them in books.

bugbuddy 4 days ago | parent | prev | next [-]

LLM’s source of “knowledge” is almost purely statistical. The prompt injections create statistical noise that make the token search a crapshoot. My guess is there are certain words and phrases that generate and amplifies the statistical noise.

throwanem 4 days ago | parent | prev [-]

I wonder if there's variation at play here in testing culture, whether spatially or temporally or both.

CJefferson 4 days ago | parent | prev | next [-]

As someone who has written and graded a lot of University exams, I'm sure a decent number of students would write the wrong answer to that. A bunch of students would write 5 (adding all the numbers). Others would write "3 apples and 2 cats", which is technically not what I'm looking for (but personally I would give full marks for, some wouldn't).

Many students clear try to answer exams by pattern matching, and I've seen a lot of exams of students "matching" on a pattern based on one word on a question and doing something totally wrong.

jonathanlydall 4 days ago | parent | next [-]

Many professionals with lower skilled jobs sometimes lean too heavily on pattern matching too.

For example, customer service reps tend to often vaguely match your request with a possibly or only vaguely applicable templated response.

Technically savvy customers who tend to try explain problems in detail are probably more likely to get an actually non-applicable canned response as the CS rep gets frustrated with the amount of information and will latch onto the first phrase which relates to a templated response without really considering context.

My reply’s getting a little tangential now, but I feel this is good life advice, I’ve found I’m more likely to get decent customer service if I keep my requests as short as possible.

The first sentence needs to essentially state the issue I need help with. In some cases a bulleted list of things I’ve tried helps and then I’m sure to include essential info like an account number, e.g.

I’m getting error 13508 when I try log into my account. I’ve already tried the following solutions with no success:

- Clearing my browser cache and cookies.

- Restarting my computer.

- Running all software updates.

My account number: xxx

What is the next step here?

marcus_holmes 4 days ago | parent [-]

> What is the next step here?

The next step will be to walk you through clearing your browser cache and cookies.

Because the CS rep has no idea who you are, and your protestations of competency fall on deaf ears because they've dealt with 23325424 people in the last year that claimed to know what they're doing but actually didn't at all.

Their goal is to get through the script, because getting through the script is the only way to be sure that it's all been done the way it needs to be done. And if they don't run through the script, and refer you to the next level of support, and it turns out that you hadn't actually cleared your browser cache and cookies, then that's their fault and they get dinged for it.

I always approach these situations with this understanding; that the quickest way to get my problem solved is to help them work through their script. And every now and then, just occasionally, working through the script has shown up something simple and obvious that I'd totally missed despite my decades of experience.

fc417fc802 3 days ago | parent [-]

The robots are even worse than the humans. Recently I got one when I called an ISP that insisted on calling back after restarting all the equipment and waiting 10 minutes. Never mind that the issue was entirely unrelated to the equipment. It had asked for a description of the problem but apparently couldn't actually do anything with that information. After refusing it enough times it simply hung up on me.

Obviously I don't do business with that company anymore.

jaccola 4 days ago | parent | prev | next [-]

Parents whole point is contrary to this (they agree with you), the context didn't even include numbers to pattern match on!

CJefferson 4 days ago | parent [-]

Sorry, I failed at pattern matching myself :)

However, I still think any irrelevant facts would upset a number of exam takers, and claiming it "clearly" wouldn't is far too strong a claim to make without evidence.

kazinator 4 days ago | parent | prev | next [-]

When you try wing your way through a question by pattern matching, then you are not applying intelligence. Your interests lie elsewhere and so you are just fumbling your way through the activity at hand just to get through it.

crabmusket 4 days ago | parent [-]

This is something that the rise of LLMs has highlighted for me. Sometimes, we don't care to apply our intelligence to a problem. I've come to think of myself as "acting like an LLM" when I do this.

It reminds me of Kahneman's "system 1" (fast) and "system 2" (slow) thinking. LLMs are system 1 - fast, intuitive, instinctual. Humans often think that way. But we can also break out system 2 when we choose to, and apply logic, reason, etc.

kazinator 4 days ago | parent [-]

Other "LLM Like" behaviors: telling corny jokes based on puns, using thought-terminating cliches, freely associating irrelevant cultural references in serious discussion ...

viccis 4 days ago | parent | prev [-]

I agree that poor test takers are easily distracted, and this is the reason that "word problems" are heavily emphasized in preparation for tests like the SAT or state proficiency exams.

But in general I do not think these models are claiming at being good at replicating the performance of a distracted or otherwise low performing pupil. I think they should be evaluated against humans who are capable of completing word problems containing context that is not inherently necessary to the math question. The reason those tests I mentioned use these word problems is that it's a way to evaluate someone's ability to think in abstract mathematical terms about everyday situations, which obviously involve lots of unimportant information the person must choose to consider or not.

tl;dr: I think a reasonably competent high school student could answer the apple and cat question, which is absolutely a reasonable bar for an LLM to clear. If university students are failing these questions, then they have not been taught test taking skills, which should be considered a mathematical failure just as unacceptable as that of the LLM, not a mitigating similarity for the latter.

wagwang 4 days ago | parent | prev | next [-]

Yes, especially interview questions that include a stupid "real life example" that is usually irrelevant to the question.

wongarsu 4 days ago | parent | prev | next [-]

If asked verbally that would absolutely confuse some humans. Easily enough to triple the error rate for that specific question (granted, that's easier than the actual questions, but still). Even in a written test with time pressure it would probably still have a statistically significant effect

kazinator 4 days ago | parent | next [-]

The problem with your reasoning is that some humans cannot solve the problem even without the irrelevant info about cats.

We can easily cherry pick our humans to fit any hypothesis about humans, because there are dumb humans.

The issue is that AI models which, on the surface, appear to be similar to the smarter quantile of humans in solving certain problems, become confused in ways that humans in that problem-solving class would not be.

That's obviously because the language model is not generally intelligent it's just retrieving tokens from a high-dimensional statistically fit function. The extra info injects noise into the calculation which confounds it.

krisoft 4 days ago | parent | next [-]

> We can easily cherry pick our humans to fit any hypothesis about humans, because there are dumb humans.

Nah. You would take a large number of humans, make half of them take the test with distractions and half without distracting statements and then you would compare their results statistically. Yes there would be some dumb ones, but as long as you test on enough people they would show up in both samples rougly at the same rate.

> become confused in ways that humans in that problem-solving class would not be.

You just state the same thing others are disputing. Do you think it will suddenly become convincing if you write it down a few more times?

Kuinox 4 days ago | parent | prev [-]

That's obviously because the brain is not generally intelligent it's just retrieving concepts from a high-dimensional statistically fit function. The extra info injects noise into the calculation which confounds it.

kazinator 4 days ago | parent | next [-]

The problem with your low-effort retort is that, for example, the brain can wield language without having to scan anywhere near hundreds of terabytes of text. People acquire language from vastly fewer examples, and are able to infer/postulate rules, and articulate the rules.

We don't know how.

While there may be activity going on in the brain interpretable as high-dimensional functions mapping inputs to outputs, you are not doing everything with just one fixed function evaluating static weights from a feed-forward network.

If it is like neural nets, it might be something like numerous models of different types, dynamically evolving and interacting.

Kuinox 3 days ago | parent [-]

The problem with your answer is that you make affirmations using logical fallacies. We both don't know how LLMs, and brains works to produce output. Any affirmation toward that without proof is affirming things without any basis.

For example in this response: > the brain can wield language without having to scan anywhere near hundreds of terabytes of text.

The amount of text we need to train an LLM only goes down, even 2 years ago it was showed you need less than a few millions words: https://tallinzen.net/media/papers/mueller_linzen_2023_acl.p... , in order to "acquire" english.

kazinator 3 days ago | parent [-]

Training the weights of the neural network produces a humungous function with a vast number of parameters.

Such a function is not inherently mysterious due to the size alone. For instance, if we fit a billion numeric points to a polynomial curve having a billion coefficients, we would not be mystified as to how the polynomial interpolates between the points.

Be that as it may, the trained neural network function does have mysterious properties, that is true.

But that doesn't mean we don't know how it works. We invented it and produced it by training.

To say that we completely don't understand it is like saying we don't understand thermodynamic because the laws of thermodynamic don't allow us to predict the path of a particle of gas in, and so we must remain mystified as to how the gas can take on the shape of the container.

Say we train a neural network to recognize digit characters. Of course we know why it produces the answer 3 when given any one of our training images of 3: we iterated on bumping the weights until it did that. When we give it a an image of 3 not in our training set and it produces some answer (either correctly 3 or something disappointing) we are less sure. We don't exactly know the exact properties of the multi-dimensional function which encode the "threeness" of the image.

Sure; so what? It's a heck of a lot more than we know about how a person recognizes a 3, where we had no design input, and don't even know the complete details of the architecture. We don't have a complete model of just one neuron, whereas we do have a complete model of a floating-point number.

Gas in a container is a kind of brain which figures out how to mimic the shape of the container using a function of a vast number of parameters governing the motion of particles. Should we be mystified and declare that we don't understand the thermodynamic laws we came up with because they don't track the path taken by a particle of gas, and don't explain how every particle "knows" where it is supposed to be so that the gas takes on the shape of the cylinder, and has equal pressure everywhere?

Kuinox 2 days ago | parent [-]

> we would not be mystified as to how the polynomial interpolates between the points.

We would not be surprised - we wouldnt know how the model resolve the problem. We wouldn't know if it is approximating, calculating the correct value, or memorising result. We would only know how it was built. We would be mystified in how it solved the problem.

> But that doesn't mean we don't know how it works. We invented it and produced it by training.

It is not because it was invented that we know how it works. The fallacy in your reasoning is thinking that Emergent Behavior or Properties can be trivialy explained, by knowing it's building block.

const_cast 4 days ago | parent | prev [-]

Yes, how... obvious?

I don't know, do we even know how the brain works? Like, definitively? Because I'm pretty sure we don't.

Kuinox 3 days ago | parent [-]

Yeah we don't, that's one of the point of my reply, we don't know how LLMs works either.

3 days ago | parent [-]
[deleted]
cantor_S_drug 4 days ago | parent | prev | next [-]

Is the model thinking what is cat doing here? Then start thinking it is being tested?

lawlessone 4 days ago | parent | next [-]

Even if the model "ignores" it. Won't the presence of the irrelevant text alter the probability of its output in some way?

wongarsu 4 days ago | parent | prev | next [-]

I have no clue what the model is thinking, and as far as I can tell the paper also makes no attempt at answering that. It's also not really the point, the point is more that the claim in the paper that humans would be unaffected is unsubstantiated and highly suspect. I'd even say more likely wrong than right

xienze 4 days ago | parent | next [-]

> It's also not really the point, the point is more that the claim in the paper that humans would be unaffected is unsubstantiated and highly suspect.

I think the question that adds a random cat factoid at the end is going to trip up a lot fewer humans than you think. At the very least, they could attempt to tell you after the fact why they thought it was relevant.

And ignoring that, obviously we should be holding these LLMs to a higher standard than “human with extraordinary intelligence and encyclopedic knowledge that can get tripped up by a few irrelevant words in a prompt.” Like, that should _never_ happen if these tools are what they’re claimed to be.

lawlessone 4 days ago | parent [-]

I'm sure humans would be affected in some way. But not al all the same way an LLM would.

A human would probably note it as a trick in their reply.

The way LLMs work it could bias their replies in weird ways by changing their replies in unexpected ways beyond seeing it as a trick.

cantor_S_drug 4 days ago | parent | prev [-]

They should prompt the model to ignore irrelevant information and test if the model performs better and is good at ignoring those statements?

Detrytus 3 days ago | parent | prev [-]

I wonder if the problem here is simply hitting some internal quota on compute resources? Like, if you send the model on wild goose chase with irrelevant information it wastes enough compute time on it that it fails to arrive at correct answer to main question.

cantor_S_drug 3 days ago | parent [-]

Possibly. But could indicate that initial tokens set the direction or the path model could go down into. Just like when a person mentions two distinct topics in conversation nearby, the listener decides which topic to continue with.

lawlessone 4 days ago | parent | prev [-]

a human would immediately identify it as a trick.

metalman 4 days ago | parent | prev | next [-]

"wouldn't confuse most humans", yes but no first presumption is that we are talking about humans doing math, in some sort of internet setting. second presumption is that this human has been effected by the significant percentage of the internet devoted to cats and that there response is going to be likely frustration and outrage at cats invading math, or massive relief in having cat meems worked into something otherwise tedious and then the third presumption is that a large number of "humans" wont be aware of the cats in math thing, because they imediatly offloaded the task to an LLM

graeme 4 days ago | parent | prev | next [-]

It absolutely would if you start hitting working memory constraints. And at the margins some people who would be 50:50 on a given math problem will have working memory constraints.

lupusreal 4 days ago | parent | prev [-]

Any kind of distraction is likely to impact human test scores, unless the test is well below their level or they're otherwise very comfortable with the subject matter. Math specifically makes most of the general public feel a bit in over their head, so tossing random cat facts into the mix is going to get people more confused and nervous.

Maybe I'm totally wrong about that, but they really should have tested humans too, without that context this result seems lacking.

pinkmuffinere 4 days ago | parent | prev | next [-]

Ya, I specifically remember solving word problems in school / college and getting distracted by irrelevant details. Usually I would get distracted by stuff that _seemed_ like it should be used, so maybe cat facts would be fine for me to tease out, but in general I don't think I'm good at ignoring extraneous information.

Edit: To be fair, in the example provided, the cat fact is _exceptionally_ extraneous, and even flagged with 'Fun Fact:' as if to indicate it's unrelated. I wonder if they were all like that.

dylan604 4 days ago | parent | next [-]

I had always assumed that the extraneous information was part of the test. You have to know/understand the concept well enough to know that the information was extraneous.

kayodelycaon 4 days ago | parent [-]

From what I remember of school, extraneous information was rarely included and the teachers who did add extraneous information seemed to do it maliciously.

There was one math class at a private school I attended that was the exception. The textbook had identifying relevant information as part of several chapters.

brazzy 4 days ago | parent | prev [-]

It's a well-known problem for humans as well: https://en.wikipedia.org/wiki/Age_of_the_captain

sejje 4 days ago | parent | prev | next [-]

Humans are used to ignoring things while LLMs are explicitly trained to pay attention to the entire text.

Humans who haven't been exposed to trick problems or careful wording probably have a hard time, they'll be less confident about ignoring things.

But the LLM should have seen plenty of trick problems as well.

It just doesn't parse as part of the problem. Humans have more options, and room to think. The LLM had to respond.

I'd also like to see how responses were grouped, does it ever refuse, how do refusals get classed, etc. Were they only counting math failures as wrong answers? It has room to be subjective.

Y_Y 4 days ago | parent [-]

> LLMs are explicitly trained to pay attention to the entire text

I'd respectfully disagree on this point. The magic of attention in transformers is the selective attention applied, which ideally only gives significant weight to the tokens relevant to the query.

mcswell 4 days ago | parent | next [-]

Ideally, yes. But probably because of our world knowledge, we humans know that cat-facts don't affect mathematic facts (unless of course the cat is walking across the keyboard, in which case all bets are off). LLCs don't know that, and perhaps they're trying to figure out some connection by scanning their database for mathematical facts about cats. If they sleep most of the day, how many hours is that? Does that number factor (pardon the pun) into the math problem? What about six-toed cats (which do btw exist)? Spherical cows come up in math and physics, are there triangular cats (since the problem is about triangles)?

cubefox 4 days ago | parent | prev [-]

This raises the question whether the performance of LLMs with SSM architecture (Mamba) would be different from the Transformer models they tested. Because SSMs do not use attention layers.

The model architecture is actually already known to have effects on some tasks. In particular, SSMs are worse than transformers at retrieving specific information from the context window [1], which e.g. reduces their performance on multiple choice benchmarks. Which is a performance difference that isn't reflected in their language modeling ability (perplexity).

1: https://x.com/avivbick/status/1917616943219236881

kazinator 4 days ago | parent | prev | next [-]

I doubt that the performance of those human subjects who can solve those problems when no distractors are included will be worsened by 300% when the distractors are included.

0awu35oua32 4 days ago | parent | prev | next [-]

Ooooh yeah. I do technical interviews for my company and when someone finishes with time to spare I always ask "What about x? How does that affect our solution?" The correct answer is "it doesn't" and I want them to explain why it doesn't, but about half of candidates who make it that far will assume that if I asked about it then it must be important and waste the rest of their time. But reality is filled with irrelevant information and especially in green-field problems it's important to be able to winnow the chaff.

layer8 4 days ago | parent | prev | next [-]

It would have been interesting to see how a human control group performs, but it also seems highly unlikely that it would triple their error rate.

slashdave 4 days ago | parent | prev | next [-]

Not sure how useful a comparison to humans would be, and to expect a degradation of 300% seems to stretch things a bit. After all, cats can jump up to five times their height.

Terretta 2 days ago | parent | prev | next [-]

If you spell “sit in the tub” s-o-a-k soak, and you spell “a funny story” j-o-k-e joke, how do you spell “the white of an egg”?

Context engineering* has been around longer than we think. It works on humans too.

The cats are just adversarial context priming, same as this riddle.

* I've called it "context priming" for a couple years for reasons showed by this child's riddle, while considering "context engineering" as iteratively determining what priming unspools robust resilient results for the question.

protocolture 4 days ago | parent | prev | next [-]

Guilty. I remember taking an aptitude test in primary school, and choosing an answer based on my familiarity with the subject in the math test (IIRC the question mentioned the space shuttle) instead of actually attempting to solve the problem. I got cleanly filtered on that test.

mvdtnz 4 days ago | parent | prev | next [-]

Did you read a single one of the examples? No human would be influenced by these.

viccis 4 days ago | parent [-]

It's ridiculous. People in here are acting like adding some trivia about a cat would destroy most peoples' ability to answer questions. I don't know if it's contrarianism, AI defensiveness, or an egotistical need to correct others with a gotcha, but people just LOVE to rush to invent ridiculous situations and act like it breaks a very reasonable generalization.

rsynnott 3 days ago | parent [-]

A lot of this website is _ultra_ offended by any suggestion that LLMs are not all that.

Xss3 4 days ago | parent | prev [-]

Read the article before commenting next time and you wont end up looking like a typical redditor.

cwillu 4 days ago | parent [-]

“Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that". ”

--https://news.ycombinator.com/newsguidelines.html

Xss3 3 days ago | parent [-]

Thanks, will stick to that in future