| ▲ | buttered_toast 2 hours ago | |||||||
I think we need to reevaluate what purpose these sorts of questions serve and why they're important in regards to judging intelligence. The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released. Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance. This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model. | ||||||||
| ▲ | energy123 2 hours ago | parent [-] | |||||||
They're important but not as N=1. It's like cherry picking a single question from SimpleQA and going aha! It got it right! Meanwhile it's 8% lower score than some other model when evaluated on all questions. | ||||||||
| ||||||||