Remix.run Logo
themanmaran 12 hours ago

> The metric reflects the proportion of all tokens served by reasoning models, not the share of "reasoning tokens" within model outputs.

I'd be interested in a clarification on the reasoning vs non-reasoning metric.

Does this mean the reasoning total is (input + reasoning + output) tokens? Or is it just (input + output).

Obviously the reasoning tokens would add a ton to the overall count. So it would be interesting to see it on an apples to apples comparison with non reasoning models.

ribosometronome 11 hours ago | parent | next [-]

As would models that that are overly verbose. My experience is the Claude tends to do more than is asked for (e.g. immediately move on to creating tests and documentation) while other models like Gemini tend to be more concise in what they do.

reeeli 12 hours ago | parent | prev [-]

I'm out of time but "reasoning input tokens" from fortune 5000 engineers sounds like a lobotomized LSD dream, would you care on elaborating how you distinguish between reasoning and non-reasoning? vs "question on duty"?

themanmaran 11 hours ago | parent | next [-]

"reasoning" models like GPT 5 et al do a pre-generation step where they:

- Take in the user query (input tokens)

- Break that into a game plan. Ex: "Based on user query: {query} generate a plan of action." (reasoning tokens)

- Answer (output tokens)

Because the reasoning step runs in a loop until it's run through it's action plan, it frequently uses way more tokens than the input/output step.

reeeli 10 hours ago | parent [-]

that was useful, thank you.

I have sooo many issues with the naming scheme of this """""AI"""" industry", it's crazy!

So the LLM gets a prompt, then creates a scheme to pull pre-weighted tokens post-user-phrasing, the constituents of which (the scheme) are called reasoning tokens, which it only explicitly distinguishes as such because there are hundreds or even thousands of output tokens to the hundreds and/or thousands of potential reasoning input tokens that were (almost) equal to the actually chosen reasoning input tokens based on the more or less adequately phrased question/prompt given ... as input ... by the user ...

IgorPartola 9 hours ago | parent [-]

You can call them planning if you want or pre-planning. But I would encourage you to play with the API version of your model of choice to see exactly what this looks like. It’s kind of like a human’s internal monologue: “got an email from my boss asking to write unit tests for the analytics API. First I have to look at the implementation to know how exactly it actually functions, then write out what kinds of tests make sense, then implement the tests. I should write a TODO list of these steps.”

It is essentially a way to expand the prompt further. You can achieve the same exact thing by turning off the “thinking” feature and just being more detailed and step by step in your prompt but this is faster.

My guess is that the next evolution of this will be models that do an edit or review step after to catch if any of the constraints were broken. But best I can tell a reasoning model can be approximated by doing two passes of a non-reasoning model: first pass you give it the user prompt with instructions that boil down to “make sense of this prompt and formulate a plan” and the second pass you give it the original prompt, the plan, and an explanation that the plan is to implement the original prompt using the plan.

typs 11 hours ago | parent | prev [-]

I believe they’re just classifying all models into “reasoning models” eg o3 vs “non reasoning models” eg 4o and just doing a comparison of total tokens (input tokens + hidden reasoning output tokens + shown output tokens)

maikakz 11 hours ago | parent [-]

that's exactly right!

DIAexitNode 9 hours ago | parent [-]

hell yeah, 109 out of 10 doors opened! 99 bonus doors! what are you talking about, man?