Remix.run Logo
krisoft 4 days ago

> which really wouldn't confuse most humans

And i think it would. I think a lot of people would ask the invigilator to see if something is wrong with the test, or maybe answer both questions, or write a short answer on the cat question too or get confused and give up.

That is the kind of question where if it were put to a test I would expect kids to start squirming, looking at each other and the teacher, right as they reach that one.

I’m not sure how big this effect is, but it would be very surprising if there is no effect and unsuspecting, and unwarned people perform the same on the “normal” and the “distractions” test. Especially if the information is phrased as a question like in your example.

I heard it from teachers that students get distracted if they add irrelevant details to word problems. This is obviously anecdotal, but the teachers who I chatted about this thought it is because people are trained through their whole education that all elements of world problems must be used. So when they add extra bits people’s minds desperately try to use it.

But the point is not that i’m right. Maybe i’m totaly wrong. The point is that if the paper want to state as a fact one way or an other they should have performed an experiment. Or cite prior research. Or avoided stating an unsubstantiated opinion about human behaviour and stick to describing the AI.

diamond559 4 days ago | parent | next [-]

Yeah you're right, if that human is 5 years old or has crippling ADHD.

atq2119 4 days ago | parent | next [-]

Not at all. There are cultural expectations within each field of what kind of questions students expect to be on a test. If those expectations are violated by the test, students will reasonably be distracted, second-guess themselves, etc.

krisoft 4 days ago | parent | prev | next [-]

You can argue until the cows come home. The point is that they claim without evidence that humans are not suspectible to this kind of distraction.

If they want to estabilish this as a fact there is a trivialy easy experiment they can conduct.

“Someone on hacker news strongly feels it is true, and is willing to argue the case with witty comments.” is not how scientific knowledge is estabilished. We either have done the experiments and have the data, or we don’t.

imtringued 3 days ago | parent [-]

The answer is three apples.

ACCount36 4 days ago | parent | prev [-]

You think too highly of humans.

Humans are not reliable. For every "no human would make this kind of mistake", you can find dozens to hundreds of thousands of instances of humans making this kind of mistake.

const_cast 4 days ago | parent | next [-]

That's just because there's a lot of humans and we're doing a lot of things, all the time.

Humans are pretty good at not making mistakes in high-reasoning scenarios. The problem is that humans make mistakes in everything pretty constantly. Like, even saying a word - people say the wrong word all the time.

So when we look at really easy tasks that can be trivially automated, like say adding 2 + 2, we say "humans are so stupid! Computer is smart!".

Because humans get 2 + 2 wrong 1% of the time, but computers always get it right.

But, as we know, this isn't how it works. Actually, humans are much smarter than computers, and it's not even close. Because intelligence is multi-dimensional. The thing is, that failure rate for humans stays pretty constant as the complexity of the task increases, to a degree. Whereas computers start failing more and more, and quickly. It's a very, VERY sharp cliff for algorithms.

LLMs take the cliff further, but they do not eliminate it.

margalabargala 4 days ago | parent | prev [-]

A reasonable person [0] would not make that mistake.

[0] https://en.m.wikipedia.org/wiki/Reasonable_person

ACCount36 4 days ago | parent [-]

[flagged]

dolebirchwood 4 days ago | parent | next [-]

If nothing else, you're certainly making your case stronger with each successive comment.

margalabargala 4 days ago | parent | prev [-]

No but I've read about them in books.

bugbuddy 4 days ago | parent | prev | next [-]

LLM’s source of “knowledge” is almost purely statistical. The prompt injections create statistical noise that make the token search a crapshoot. My guess is there are certain words and phrases that generate and amplifies the statistical noise.

throwanem 4 days ago | parent | prev [-]

I wonder if there's variation at play here in testing culture, whether spatially or temporally or both.