> In this view, if a machine performs a task as well as a human, it understands it exactly as much as a human. There's no problem of how to do understanding, only how to do tasks.

Yes, but you also gloss over what a "task" is or what a "benchmark" is (which has to do with the meaning of generalization).

Suppose an AI or human answers 7 questions correctly out of 10 on an ICPC problem set, what are we able infer from that?

1. Is the task equal to answering these 10 questions well, with a uniform measure of importance?

2. Is the task be good at competitive programming problems?

3. Is the task be good at coding?

4. Is the task be good at problem solving?

5. Is the task not just to be effective under a uniform measure of importance, but an adversarial measure? (i.e. you can probably figure out all kinds of competitive programming questions, if you had more time / etc... but roughly not needing "exponentially more resources")

These are very different levels of abstraction, and literally the same benchmark result can be interpreted to mean very different things. And that imputation of generality is not objective unless we know the mechanism by which it happens. "Understanding" is short-hand for saying that performance generalizes at one of the higher levels of abstraction (3--5), rather than narrow success -- because that is what we expect of a human.

▲

simianwords 20 hours ago | parent [-]

How do you quantify generality? If we have a benchmark that can quantify it and that benchmark reliably tells us that the LLM is within human levels of generalisation then the llm is not distinguishable from a human.

While it’s a good point that we need to benchmark generalisation ability, you have in fact agreed that it is not important to understand underlying mechanics.

	▲	godelski 10 hours ago \| parent [-]
		That's kinda their point The difference though is they understand that you can't just benchmark your way into proofs. Just like you can't unit test your way into showing code is error free. Benchmarks and unit tests are great tools that provide a lot of help, but just because a hammer is useful doesn't make everything a nail.