| ▲ | robrenaud 4 hours ago | |
Benchmarks need to change. There is a 4 choice choice question. Your best guess is the answer is B, at about 35% chance of being right. If you are graded on fraction of questions answered correctedly, the optimization pressure is simply to answer B. If you could get half credit for answering "I don't know", we'd have a lot more models saying that when they are not confident. | ||