Seems like the way to go for any smaller models is to only use the low reasoning levels, and for anything where you'd want it to reason harder, to just use a larger model.

In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).

▲ adam_arthur 7 hours ago | parent | next [-]

I've found disabling reasoning entirely but adding a "reason" to the JSON response from the LLM to work significantly faster and consume many fewer tokens for narrowly scoped prompts.

At least for Claude family models.

e.g. {

  "reason": "<Describe why you picked this result>",

  "selection": "<The number of the value you selected>"

}

I'm sure native reasoning produces more accurate results, but for my use case the quality was about the same, and the model would reason for thousands of tokens in native reasoning vs just 1-200 with response level reasoning.

Again, to be clear, this is for deterministic/pipeline style workflows, not agentic/coding use.

▲ docheinestages 9 hours ago | parent | prev | next [-]

My experience with using low reasoning effort has been nothing but a waste of time. Claude often keeps guessing, not calling tools to ground itself, and basically at the end I end up wasting the same amount of tokens or just switch to Opus on xhigh. It's been a terrible experience.

▲ mwigdahl 9 hours ago | parent | prev [-]

Not to sound like an LLM, but that seems exactly right to me. Use it as a cheaper, high-functioning task subagent and lower reasoning for a master Opus session. As long as not every portion of your task requires maximum intelligence, you should come out ahead.

▲

user43928 9 hours ago | parent [-]

Won't any input be charged uncached, and the output of the small model charged again as uncached input to the bigger model?

I don't know whether that comes out ahead compared to just staying with the better model in the first place.

▲

mwigdahl 8 hours ago | parent [-]

It's a good question, but for multiturn conversations even cached context adds up quickly. My experience has been that spawning off subagents for defined tasks in a large overall plan generally makes me come out ahead.

I'm sure folks' mileage will vary though.

	▲	noisy_boy 3 hours ago \| parent [-]
		I asked this question and was told that even if it is counter intuitive, medium will be more cost efficient due to caching. Changed to medium, blew my budget and went back to low.