Remix.run Logo
Doohickey-d a day ago

> Users requiring raw chains of thought for advanced prompt engineering can contact sales

So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.

In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.

a_bonobo 20 hours ago | parent | next [-]

Could the exclusion of CoT that be because of this recent Anthropic paper?

https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...

>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.

I.e., chain of thought may be a confabulation by the model, too. So perhaps there's somebody at Anthropic who doesn't want to mislead their customers. Perhaps they'll come back once this problem is solved.

whimsicalism 18 hours ago | parent | next [-]

i think it is almost certainly to prevent distillation

andrepd 13 hours ago | parent | prev [-]

I have no idea what this means, can someone give the eli5?

a_bonobo 11 hours ago | parent | next [-]

Anthropic has a nice press release that summarises it in simpler terms: https://www.anthropic.com/research/reasoning-models-dont-say...

meesles 12 hours ago | parent | prev | next [-]

Ask an LLM!

otabdeveloper4 9 hours ago | parent | prev [-]

I don't either, but chain of thought is obviously bullshit and just more LLM hallucination.

LLMs will routinely "reason" through a solution and then proceed to give out a final answer that is completely unrelated to the preceding "reasoning".

aqfamnzc 8 hours ago | parent [-]

It's more hallucination in the sense that all LLM output is hallucination. CoT is not "what the llm is thinking". I think of it as just creating more context/prompt for itself on the fly, so that when it comes up with a final response it has all that reasoning in its context window.

42lux a day ago | parent | prev | next [-]

Because it's alchemy and everyone believes they have an edge on turning lead into gold.

elcritch a day ago | parent | next [-]

I've been thinking for a couple of months now that prompt engineering, and therefore CoT, is going to become the "secret sauce" companies want to hold onto.

If anything that is where the day to day pragmatic engineering gets done. Like with early chemistry, we didn't need to precisely understand chemical theory to produce mass industrial processes by making a good enough working model, some statistical parameters, and good ole practical experience. People figured out steel making and black powder with alchemy.

The only debate now is whether the prompt engineering models are currently closer to alchemy or modern chemistry? I'd say we're at advanced alchemy with some hints of rudimentary chemistry.

Also, unrelated but with CERN turning lead into gold, doesn't that mean the alchemists were correct, just fundamentally unprepared for the scale of the task? ;)

parodysbird 16 hours ago | parent [-]

The thing with alchemy was not that their hypotheses were wrong (they eventually created chemistry), but that their method of secret esoteric mysticism over open inquiry was wrong.

Newton is the great example of this: he led a dual life, where in one he did science openly to a community to scrutinize, in the other he did secret alchemy in search of the philosopher's stone. History has empirically shown us which of his lives actually led to the discovery and accumulation of knowledge, and which did not.

iamcurious 15 hours ago | parent [-]

Newton was a smart guy and he devoted a lot of time to his occult research. I bet that a lot of that occult research inspired the physics. The fact that his occult research remains, occult from the public, well that is natural aint it?

parodysbird 9 hours ago | parent [-]

You can be inspired by anything, that's fine. Gell-mann was amusing himself and getting inspiration from Buddhism for quantum physics. It's the process of the inquiry that generates the knowledge as a discipline, rather than the personal spark for discovery.

viraptor 14 hours ago | parent | prev [-]

We won't know without an official answer leaking, but a simple answer could be - people spend too much time trying to analyse those without understanding the details. There was a lot of talk on HN about the thinking steps second guessing and contradicting itself. But in practice that step is both trained by explicitly injecting the "however", "but" and similar words and they do more processing than simply interpreting the thinking part as text we read. If the content is commonly misunderstood, why show it?

pja a day ago | parent | prev | next [-]

IIRC RLHF inevitably compromises model accuracy in order to train the model not to give dangerous responses.

It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.

This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).

landl0rd a day ago | parent | next [-]

Yeah we really should stop focusing on model alignment. The idea that it's more important that your AI will fucking report you to the police if it thinks you're being naughty than that it actually works for more stuff is stupid.

xp84 a day ago | parent | next [-]

I'm not sure I'd throw out all the alignment baby with the bathwater. But I wish we could draw a distinction between "Might offend someone" with "dangerous."

Even 'plotting terror attacks' is not something terrorists can do just fine without AI. And as for making sure the model wouldn't say ideas that are hurtful to <insert group>, it seems to me so silly when it's text we're talking about. If I want to say "<insert group> are lazy and stupid," I can type that myself (and it's even protected speech in some countries still!) How does preventing Claude from espousing that dumb opinion, keep <insert group> safe from anything?

landl0rd a day ago | parent | next [-]

Let me put it this way: there are very few things I can think of that models should absolutely refuse, because there are very few pieces of information that are net harmful in all cases and at all times. I sort of run by blackstone's principle on this: it is better to grant 10 bad men access to information than to deny that access to 1 good one.

Easy example: Someone asks the robot for advice on stacking/shaping a bunch of tannerite to better focus a blast. The model says he's a terrorist. In fact, he's doing what any number of us have done and just having fun blowing some stuff up on his ranch.

Or I raised this one elsewhere but ochem is an easy example. I've had basically all the models claim that random amines are illegal, potentially psychoactive, verboten. I don't really feel like having my door getting kicked down by agents with guns, getting my dog shot, maybe getting shot myself because the robot tattled on me for something completely legal. For that matter if someone wants to synthesize some molly the robot shouldn't tattle to the feds about that either.

Basically it should just do what users tell it to do excepting the very minimal cases where something is basically always bad.

gmd63 20 hours ago | parent [-]

> it is better to grant 10 bad men access to information than to deny that access to 1 good one.

I disagree when it comes to a tool as powerful as AI. Most good people are not even using AI. They are paying attention to their families and raising their children, living real life.

Bad people are extremely interested in AI. They are using it to deceive at scales humanity has never before seen or even comprehended. They are polluting the wellspring of humanity that used to be the internet and turning it into a dump of machine-regurgitated slop.

hombre_fatal 18 hours ago | parent | next [-]

Yeah, it’s like saying you should be able to install anything on your phone with a url and one click.

You enrich <0.1% of honest power users who might benefit from that feature… and 100% of bad actors… at the expense of everyone else.

It’s just not a good deal.

landl0rd 16 hours ago | parent | prev [-]

1. Those people don’t need frontier models. The slop is slop in part because it’s garbage usually generated by cheap models.

2. It doesn’t matter. Most people at some level have a deontological view of what is right and wrong. I believe it’s wrong to build mass-market systems that can be so hostile to their users interests. I also believe it’s wrong for some SV elite to determine what is “unsafe information”.

Most “dangerous information” has been freely accessible for years.

eru 17 hours ago | parent | prev [-]

Yes.

I used to think that worrying about models offending someone was a bit silly.

But: what chance do we have of keeping ever bigger and better models from eventually turning the world into paper clips, if we can't even keep our small models from saying something naughty.

It's not that keeping the models from saying something naughty is valuable in itself. Who cares? It's that we need the practice, and enforcing arbitrary minor censorship is as good a task as any to practice on. Especially since with this task it's so easy to (implicitly) recruit volunteers who will spend a lot of their free time providing adversarial input.

landl0rd 16 hours ago | parent [-]

This doesn’t need to be so focused on the current set of verboten info though. Just make practice making it not say some set of random less important stuff.

latentsea a day ago | parent | prev [-]

That's probably true... right up until it reports you to the police.

Wowfunhappy a day ago | parent | prev [-]

Correct me if I'm wrong--my understanding is that RHLF was the difference between GPT 3 and GPT 3.5, aka the original ChatGPT.

If you never used GPT 3, it was... not good. Well, that's not fair, it was revolutionary in its own right, but it was very much a machine for predicting the most likely next word, it couldn't talk to you the way ChatGPT can.

Which is to say, I think RHLF is important for much more than just preventing PR disasters. It's a key part of what makes the models useful.

pja 17 hours ago | parent | next [-]

Oh sure, RLHF instruction tuning was what turned an model of mostly academic interest into a global phenomenon.

But it also compromised model accuracy & performance at the same time: The more you tune to eliminate or reinforce specific behaviours, the more you affect the overall performance of the model.

Hence my speculation that Anthropic is using a chain-of-thought model that has not been alignment tuned to improve performance. This would then explain why you don’t get to see its output without signing up to special agreements. Those agreements presumably explain all this to counter-parties that Anthropic trusts will cope with non-aligned outputs in the chain-of-thought.

Wowfunhappy a day ago | parent | prev [-]

Ugh, I'm past the edit window, but I meant RLHF aka "Reinforced Learning from Human Feedback", I'm not sure how I messed that up not once but twice!

dwaltrip a day ago | parent [-]

After the first mess up, the context was poisoned :)

sunaookami a day ago | parent | prev | next [-]

Guess we have to wait till DeepSeek mops the floor with everyone again.

datpuz a day ago | parent | next [-]

DeepSeek never mopped the floor with anyone... DeepSeek was remarkable because it is claimed that they spent a lot less training it, and without Nvidia GPUs, and because they had the best open weight model for a while. The only area they mopped the floor in was open source models, which had been stagnating for a while. But qwen3 mopped the floor with DeepSeek R1.

manmal a day ago | parent | next [-]

I think qwen3:R1 is apples:oranges, if you mean the 32B models. R1 has 20x the parameters and likely roughly as much knowledge about the world. One is a really good general model, while you can run the other one on commodity hardware. Subjectively, R1 is way better at coding, and Qwen3 is really good only at benchmarks - take a look at aider‘s leaderboard, it’s not even close: https://aider.chat/docs/leaderboards/

R2 could turn out really really good, but we‘ll see.

barnabee a day ago | parent | prev | next [-]

They mopped the floor in terms of transparency, even more so in terms of performance × transparency

Long term that might matter more

infecto 13 hours ago | parent [-]

Ehhh who knows the true motives, it was a great PR move for them though.

sunaookami a day ago | parent | prev | next [-]

DeepSeek made OpenAI panic, they initially hid the CoT for o1 and then rushed to release o3 instead of waiting for GPT-5.

csomar 20 hours ago | parent | prev | next [-]

I disagree. I find myself constantly going to their free offering which was able to solve lots of coding tasks that 3.7 could not.

codyvoda a day ago | parent | prev [-]

counterpoint: influencers said they wiped the floor with everyone so it must have happened

sunaookami a day ago | parent [-]

Who cares about what random influencers say?

infecto 13 hours ago | parent [-]

I think he is hinting at folks like you who say things like Deepseek mopping the floor when beyond some contribution to the open source community which was indeed impressive, there really has been not much of a change. No floors were mopped.

sunaookami 11 hours ago | parent [-]

See the other comments. There was change. Don't know what that has to do with influencers, I don't follow these people.

infecto 10 hours ago | parent [-]

No floors were mopped. See comment you replied to. Change happened, their research was great but no floors were mopped.

infecto 14 hours ago | parent | prev [-]

Do people actually believe this? While I agree their open source contribution was impressive, I never got the sense they mopped the floor. Perhaps firms in China may be using some of their models but beyond learnings in the community, no dents in the market were made for the West.

epolanski 15 hours ago | parent | prev | next [-]

> because it helped to see when it was going to go down the wrong track

It helped me tremendously learning Zig.

Seeing his chain of thought when asking it stuff about Zig and implementations let me widen the horizon a lot.

whiddershins 21 hours ago | parent | prev | next [-]

The trend towards opaque is inexorable.

https://noisegroove.substack.com/p/somersaulting-down-the-sl...

Aeolun a day ago | parent | prev | next [-]

The Google CoT is so incredibly dumb. I thought my models had been lobotomized until I realized they must be doing some sort of processing on the thing.

user_7832 19 hours ago | parent | next [-]

You are referring to the new (few days old-ish) CoT right? It’s bizzare as to why google did it, it was very helpful to see where the model was making assumptions or doing something wrong. Now half the time it feels better to just use flash with no thinking mode but ask it to manually “think”.

whimsicalism 18 hours ago | parent | prev [-]

it’s fake cot, just like oai

phatfish 11 hours ago | parent [-]

I had assumed it was a way to reduce "hallucinations". Instead of me having to double check every response and prompt it again to clear up the obvious mistakes it just does that in the background with itself for a bit.

Obviously the user still has to double check the response, but less often.

make3 a day ago | parent | prev [-]

it just makes it too easy to distill the reasoning into a separate model I guess. though I feel like o3 shows useful things about the reasoning while it's happening