I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)

▲

chromacity 6 hours ago | parent | next [-]

You're kidding, but it could be true? Many areas of mathematics are, first and foremost, incredibly esoteric and inaccessible (even to other mathematicians). For this one, the author stated that there might be 5-10 people who have ever made any effort to solve it. Further, the author believed it's a solvable problem if you're qualified and grind for a bit.

In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.

Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.

	▲	kuschku 4 hours ago \| parent \| next [-]
		If only 5-10 people have ever tried to solve something in programming, every LLM will start regurgitating your own decade-old attempt again and again, sometimes even with the exact comments you wrote back then (good to know it trained on my GitHub repos...), but you can spend upwards of 100mio tokens in gemini-cli or claude code and still not make any progress. It's afterall still a remix machine, it can only interpolate between that which already exists. Which is good for a lot of things, considering everything is a remix, but it can't do truly new tasks.
	▲	5 hours ago \| parent \| prev [-]
		[deleted]

▲

stabbles 2 hours ago | parent | prev | next [-]

You're glancing over the fact that mathematics uses only one token per variable `x = ...`, whereas software engineering best practices demand an excessive number of tokens per variable for clarity.

	▲	VierScar an hour ago \| parent [-]
		It's also a pretty silly thing to say difficulty = tokens. We all know line counts don't tell you much, and it shows in their own example. Even if you did have Math-like tokenisation, refactoring a thousand lines of "X=..." to "Y=..." isnt a difficult problem even though it would be at least a thousand tokens. And if you could come up with E=mc^2 in a thousand tokens, does not make the two tasks remotely comparable difficulty.

▲

nextaccountic 6 hours ago | parent | prev | next [-]

That's why context management is so important. AI not only get more expensive if you waste tokens like that, it may perform worse too

Even as context sizes get larger, this will likely be relevant. Specially since AI providers may jack up the price per token at any time.

▲

gf000 3 hours ago | parent | prev | next [-]

I think it's more of a data vs intelligence thing.

They are separate dimensions. There are problems that don't require any data, just "thinking" (many parts of math sit here), and there are others where data is the significant part (e.g. some simple causality for which we have a bunch of data).

Certain problems are in-between the two (probably a react refactor sits there). So no, tokens are probably no good proxy for complexity, data heavy problems will trivially outgrow the former category.

▲

ozozozd 6 hours ago | parent | prev | next [-]

Try the refactor again tomorrow. It might have gotten easier or more difficult.

▲

locknitpicker 4 hours ago | parent | prev | next [-]

> I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is (...)

The number of tokens required to get to an output is a function of the sequence of inputs/prompts, and how a model was trained.

You have LLMs quite capable of accomplishing complex software engineering work that struggle with translating valid text from english to some other languages. The translations can be improved with additional prompting but that doesn't mean the problem is more challenging.

▲

sublinear 7 hours ago | parent | prev [-]

You might be joking, but you're probably also not that far off from reality.

I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.

We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.

▲

johnfn 7 hours ago | parent | next [-]

I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.

If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.

▲

throw310822 7 hours ago | parent | next [-]

You can spend countless "tokens" solving minesweeper or sudoku. This doesn't mean that you solved difficult problems: just that the solutions are very long and, while each step requires reasoning, the difficulty of that reasoning is capped.

▲

gpm 7 hours ago | parent | prev | next [-]

Some thoughts.

1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.

2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.

3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.

Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.

	▲	refulgentis 5 hours ago \| parent [-]
		I fear that under those constraints, the only optimal output is “42”

▲

pinkmuffinere 7 hours ago | parent | prev [-]

This is interesting, I like the thought about "what makes something difficult". Focusing just on that, my guess is that there are significant portions of work that we commonly miss in our evaluations:

1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.

2. Verifying that the proposed solution actually is a full solution.

This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"

▲

famouswaffles 7 hours ago | parent | prev | next [-]

>The details about human involvement are always hazy and the significance of the problems are opaque to most.

Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.

▲

typs 7 hours ago | parent | prev [-]

I mean the details are in the post. You can see the conversation history and the mathematician survey on the problem