All of these discussions of models being "nerfed" reminds me of discussions among audiophiles "this cable sounds so much better than this other one, it's night and day, ferrari versus honda civic"

Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.

I bet if you do blind tests between GPT-5.3, 5.4 and 5.5 most would struggle to tell them apart, yet they are certain that "5.5 was nerfed 1 week after release, it's so obvious, it was John Carmack, now it can barely write a for loop"

▲

anentropic 3 hours ago | parent | next [-]

Exactly this. And it's not really possible to do repeatable trials, it's all just vibes. People have very little awareness of their own cognitive biases.

	▲	spiorf 2 hours ago \| parent [-]
		And companies have high awareness of this all. They have a way to decrease cost and probably increase token consumption, with gradual changes and no abrupt jump in capabilities, and users have no way to reliably detect it. Market will advantage companies that do it. And they are in the best position to automate online narrative shift (the real LLM killer application IMO) towards "Users are imagining it".

▲

pbgcp2026 an hour ago | parent | prev [-]

You will be amused to hear that when Anthropic "refreshed" 4.6 on AWS Bedrock I found it in my tests and wrote about it – and they actually rolled it back. This is how much non–coding tests may tell you about the model.