> Think about it like a multiple-choice test. If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero. In the same way, when models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say “I don’t know.”

To me, this seems to be an "US-American" way of thinking about multiple-choice tests. Other common ways to grade multiple-choice test that I have seen commonly are:

1. If the testee has the information that exactly one of N given choices is correct:

1.1 Give N-1 points for the correct answer, and -1 [negative one] point(s) for a wrong answer. This way, if the testee just answers the questions randomly, he will as expected value score 0 points.

1.2 A more brutal way if N>=3: the correct answer gives 1 point, all wrong answers give -1 points. You should learn your lesson only to give an answer if it is [alliteration unintended :-) ] correct (if N=2, the grading is identical to 1.1).

2. If there are possibly multiple correct answers, turn each item into choices of "yes" or "no" (with the option to give no answer). The correct choice gives you 1 point, the wrong gives you -1 point (i.e. as in 1.1).

▲

roxolotl 5 days ago | parent | next [-]

The SAT, American college entrance examine, used to, I haven’t looked in years so maybe it still does, take away points for wrong answers and give 0 points for no answer. I’m pretty sure it was +1 for right answer, 0 for no answer, -1/4 for wrong answer.

	▲	thaumasiotes 5 days ago \| parent [-]
		They used to do that, but then they stopped and announced that you were better off guessing because there would be no adjustment for it. A lot of what they do is based on public relations rather than psychometric validity.

▲

CGMthrowaway 4 days ago | parent | prev | next [-]

>> Think about it like a multiple-choice test. If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero. In the same way, when models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say “I don’t know.”

For TIMED multiple-choice tests (and the timed constraint makes sense in OP analogy as well), probabilistic answering is the kryptonite that lets smart people do well on SATs and IQ tests and other things like that.

I took an IQ test recently and it all came rushing back to me.

For math problems, often the right answer can be found just by inspecting the ones digit of the possible answers and process of elimination. Others, by abstracting what errors the test writer is expecting you to make, and eliminating those as possible answers. It's like magic. Sure, you could actually sit and SOLVE each problem, but when spend the time, when time is valuable?

Pretty sure these types of strategies are not actively taught to anyone unless you have a good college counselor /interested teacher/ SAT tutor. But perhaps they ought to be.

▲

mock-possum 3 days ago | parent [-]

Yeah when you realize that the fake answers to the test have been created by humans, you can predict what false answers might look like - you can even get a feel for the kind of false answers that the test’s author tends to provide, and by the end of the test you can start to spot them fairly confidently. You still check your work, obviously, but it’s like picking up on a poker player’s tell - it gives you an edge.

	▲	cindyllm 3 days ago \| parent [-]
		[dead]

▲

bananaflag 5 days ago | parent | prev [-]

This is mentioned in the text:

> This idea is not new. Some standardized tests have long used versions of negative marking for wrong answers or partial credit for leaving questions blank to discourage blind guessing.

▲

throwawaymaths 5 days ago | parent [-]

there's not really an easy way to train for that at scale. a "correct" answer may not be one token, there may be multiple synonymous answers starting with different tokens, you could add five space tokens in front of the answer amd it likely shouldn't make it "wrong".

▲

ACCount37 5 days ago | parent [-]

Yes, it's not nearly as easy as "just fix the evals".

But better evals are still helpful, because they reward LLM vendors for trying to do the very-hard-to-do thing. Instead of rewarding them for training an LLM that's really good at emitting 7% confidence guesses.

▲

throwawaymaths 5 days ago | parent [-]

you're missing the point. SAT multiple choice negatives for random guesses, fine, you could trivially use this sort of a strategy for assigning cost functions to a classifier and backpropagate. how do you give negative weight to a wrong answer when training a transformer?

▲

ACCount37 5 days ago | parent | next [-]

In RLVR? Quite easily.

And OpenAI has induced hallucinations in o3 with RLVR mistakes, not with a failed pre-training run. They used o4-mini as an example - similar training to o3 and similar issues.

Conversely, they have also designed a post-training system that has successfully reduced hallucinations in GPT-5.

	▲	5 days ago \| parent [-]
		[deleted]

▲

RugnirViking 5 days ago | parent | prev [-]

isn't this just related to the question "how do you train a transformer"? you give it wrong examples, and use optimization algorithms to move away from that kind of completions

	▲	throwawaymaths 5 days ago \| parent [-]
		thats quite hard for the reasons i explained. might be solvable using q learning techniques, but those are not easy in the context of transformers iiuc