Remix.run Logo
wahnfrieden 3 days ago

It's not dismissible as a misunderstanding of tokens. LLMs also embed knowledge of spelling - that's how they fixed the strawberry issue. It's a valid criticism and evaluation.

Lerc 3 days ago | parent | next [-]

The r's in strawberry presents a different level of task to what people imagine. It seems trivial to a naive observer because the answer is easily derivable from the question without extra knowledge.

A more accurate analogy for humans would be to imagine if every word had a colour. You are told that there are also a sequence of different colours that correspond to the same colour as that word. You are even given a book showing every combination to memorise.

You learn the colours well enough that you can read and write coherently using them.

Then comes the question of how many chocolate-browns are in teal-with-a-hint-of-red. You know that teal-with-a-hint-of-red is a fruit and you know that the colour can also be constructed by crimson followed by Disney-blond. Now, do both of those contain chocolate-brown or just one of them, how many?

It requires excersizing memory to do a task that is underrepresented in the training data because humans simply do not have to do the task at all when the answer can be derived from the question representation. Humans also don't have the ability that the LLMs need but the letter representation doesn't need that ability.

wahnfrieden 3 days ago | parent [-]

That’s what makes it a fair evaluation and something that requires improvement. We shouldn’t only evaluate agent skills by what is most commonly represented in training data. We expect performance from them on areas that existing training data may be deficient at providing. You don’t need to invent an absurdity to find these cases.

Lerc 3 days ago | parent [-]

It's reasonable to test their ability to do this, and it's worth working to make it better.

The issue is that people claim the performance is representative of a human's performance in the same situation. That gives an incorrect overall estimation of ability.

azakai 3 days ago | parent | prev | next [-]

I do think this is a tool issue. Here is what the article says:

> For the multiplication task, note that agents that make external calls to a calculator tool may have ZEH = ∞. While ZEH = ∞ does have meaning, in this paper we primarily evaluate the LLM itself without external tool calls

The models can count to infinity if you give them access to tools. The production models do this.

Not that the paper is wrong, it is still interesting to measure the core neural network of a model. But modern models use tools.

irishcoffee 2 days ago | parent [-]

So, the tools can count then?

Humans can fly, they just need wings!

azakai 2 days ago | parent [-]

It is academically interesting what pure neural networks can do, of course. But when someone goes to Claude and tries to do something, they don't care if it solves the problem using a neural network or a call out to Python. So long as the result is right.

More generally, the ability to use tools is a form of intelligence, just like when humans and crows do it. Being able to craft the right Python script and use the result is non-trivial.

cr125rider 3 days ago | parent | prev [-]

Seems like it’s maybe also a tool steering problem. These models should be reaching for tools to help solve factual problems. LLM should stick to prose.

emp17344 3 days ago | parent | next [-]

I think this is still useful research that calls into question how “smart” these models are. If the model needs a separate tool to solve a problem, has the model really solved the problem, or just outsourced it to a harness that it’s been trained - via reinforcement learning - to call upon?

dghlsakjg 3 days ago | parent | next [-]

Does it matter if the LLM can solve the problem or if it knows to use a resource?

There’s plenty of math that I couldn’t even begin to solve without a calculator or other tool. Doesn’t mean I’m not solving math problems.

In woodworking, the advice is to let the tool do the work. Does someone using a power saw have less claim to having built something than a handsaw user? Does a CNC user not count as a woodworker because the machine is doing the part that would be hard or impossible for a human?

grey-area 3 days ago | parent [-]

It does matter because the LLM doesn’t always know when to use tools (e.g. ask it for sales projections which are similar to something in its weights) and is unable to reason about the boundaries of its knowledge.

azakai 3 days ago | parent | prev [-]

It has "outsourced" it to another component, sure, but does that matter?

What the user sees is the total behavior of the entire system, not whether the system has internal divisions and separations.

emp17344 3 days ago | parent [-]

It matters if you’re curious about whether AGI is possible. Have we really built “thinking machines”, or are these systems just elaborate harnesses that leverage the non-deterministic nature of LLMs?

azakai 3 days ago | parent | next [-]

An "elaborate harness" that can break down a problem into sub-tasks, write Python scripts for the ones it can't solve itself, and then combine the results, seems able to solve a wide range of cognitive tasks?

At least in theory.

TeMPOraL 3 days ago | parent | prev [-]

What is a difference? If the "elaborate harness" consists of mix of "classical" code and ML model invocations, at which point it's disqualified from consideration for "thinking machine"? Best we can tell, even our brains have parts that are "dumb", interfacing with the parts that we consider "where the magic happens".

stratos123 3 days ago | parent | prev [-]

Are you still talking about this paper? No tools were allowed in it.