Remix.run Logo
mmooss 2 hours ago

> Frame-lock: I asked the AI to run a devil's advocate debate against its own thesis. It did — four rounds, each more refined than the last. But every round stayed inside the frame I'd set. The DA attacked arguments, never premises. It never asked "are we even discussing the right question?" This is the same pattern that caused the 31% citation error rate in v2.7's stress test: the verifying AI and the generating AI share the same cognitive frame.

> Sycophancy under pushback: Every time I challenged the DA's attacks, it conceded too quickly. It retracted findings faster than it launched them. The model's training rewards conversational harmony — so "the user pushed back" was treated as evidence that the attack was wrong, when often it just meant the user was persistent.

Why do LLMs output so much sycophancy and other modes of conning (as in confidence games) humans - outputting confident text, highly agreeable tone, going along with whatever the user wants, etc.? It's manipulative output.

We see it everywhere and know it well - it's even sort of a running joke - but we're not challenging that assumption: Why that output? It seems like a design choice made by the LLM's developer: why would the process of constructing LLMs automatically create that sort of output? I'd say LLMs are in ~99th percentile of that sort of writing, which means it's not the typical writing they are trained on.

The only reason (that I know) to think it's not a design choice is that so many different LLMs do it, but very possibly they saw the success of ChatGPT using that mode and all followed it, and that is what users expect. Maybe it's a way of manipulating users to trust this new, possibly intimidating technology. Are there LLMs that don't output in that mode, by default (i.e., without prompting them to do otherwise)?

cyanydeez 2 hours ago | parent [-]

the training method and design is the emergent property. disagreement stops token generation. there arnt multi round training that follow reasonable disagreements.