Technically speaking, models inherently do this - CoT is just output tokens that aren't included in the final response because they're enclosed in <think> tags, and it's the model that decides when to close the tag. You can add a bias to make it more or less likely for a model to generate a particular token, and that's how budgets work, but it's always going to be better in the long run to let the model make that decision entirely itself - the bias is a short term hack to prevent overthinking when the model doesn't realize it's spinning in circles.

▲

ai_slop_hater 2 hours ago | parent [-]

> You can add a bias to make it more or less likely for a model to generate a particular token, and that's how budgets work

Do you have a source for this? I am interested in learning more about how this works.

▲

koverstreet 2 hours ago | parent [-]

It's how temperature/top_p/top_k work. Anthropic also just put out a paper where they were doing a much more advanced version of this, mapping out functional states within the modern and steering with that.

▲

ai_slop_hater 2 hours ago | parent [-]

Huh, I wonder if that's why you cannot change the temperature when thinking is enabled. Do you have a link for the paper?

▲

koverstreet 2 hours ago | parent [-]

https://transformer-circuits.pub/2026/emotions/index.html

At the actual inference level temperature can be applied at any time - generation is token by token - but that doesn't mean the API necessarily exposes it.

	▲	ai_slop_hater 2 hours ago \| parent [-]
		Thanks. I was referring to the fact that Anthropic, in their API, prohibits setting temperature when thinking is enabled.