| ▲ | simianwords 2 days ago | |
This is a very poor article. What I understood is that they take one benchmark (in particular) that tests grade school level math. This benchmark apparently claims to test ability to reason through math problems. They agree that the benchmarks show that the LLMs can solve such questions and models are getting better. But their main point is that this does not prove that the model is reasoning. But so what??? It may not reason in the way humans do but it is pretty damn close. The mechanics are the same - recursively generate a prompt that terminates in an answer generating prompt. They don’t like that this indicates the model “reasons through” the problem. But it’s just semantics at this point. For me and for most others - getting the final answer is important. And it largely accomplishes this task. I don’t buy that the model couldn’t reason through - have you ever asked a model for its explanation? It does genuinely explain how it got the solution. At this point who the hell cares what “reasoning” means if it 1. Gets me the right answer 2. Reasonably explains how it did it | ||
| ▲ | YeGoblynQueenne a day ago | parent [-] | |
We care whether it's reasoning or not because the alternative is that it's guessing, rather than reasoning, and when guessing is measured on benchmarks that are supposed to measure reasoning, the results are likely to be misleading. Why do we care if the benchmark results are misleading? The reason we have benchmarks in machine learning is that we can use the results on the benchmarks to predict the performance of a system in uncontrolled conditions, i.e. "in the real world". If the benchmarks don't measure what we think they measure then they can't be used to make that kind of prediction. If that's the case then we really have no idea how good or bad a system really is. Seen another way, if a benchmark is not measuring what we think it measures, all we learn from a system passing the benchmark is that the system passes the benchmark. Still, what do you care if it gets you the right answer? The question is, exactly, how do you know it's really getting you the right answer? Maybe you can tell when you know the answer, but what about answers you genuinely don't know? And how often does it get you the wrong answer but you don't realise? You can't realistically test an AI system by interacting with it as thoroughly and as rigorously as you can with... a benchmark. That's why we care about having accurate benchmarks that measure what they're supposed to be measuring. P.S. Another issue of course is that guessing is limited while reasoning is... less limited. We care about reasoning because we ideally want to have systems that are better than the best guessing machine. | ||