Remix.run Logo
ACCount37 5 days ago

Anthropic claims that they don't degrade models under load, and the performance issues were a result of a system error:

https://status.anthropic.com/incidents/72f99lh1cj2c

That being said, they still have capacity issues on any day of the week that ends in Y. No clue how long would that take to resolve.

fragmede 5 days ago | parent | next [-]

> Last week, we opened an incident to investigate degraded quality in some Claude model responses. We found two separate issues that we’ve now resolved.

mh- 5 days ago | parent | prev | next [-]

Not nitpicking, but they said:

> we never intentionally degrade model quality as a result of demand or other factors

Fully giving them the benefit of the doubt, I still think that still allows for a scenario like "we may [switch to quantized models|tune parameters], but our internal testing showed that these interventions didn't materially affect end user experience".

I hate to parse their words in this way, because I don't know how they could have phrased it that closed the door on this concern, but all the anecdata (personal and otherwise) suggests something is happening.

ACCount37 5 days ago | parent | next [-]

"Anecdata" is notoriously unreliable when it comes to estimating AI performance over time.

Sure, people complain about Anthropic's AI models getting worse over time. As well as OpenAI's models getting worse over time. But guess what? If you serve them open weights models, they also complain about models getting worse over time. Same exact checkpoint, same exact settings, same exact hardware.

Relative LMArena metrics, however, are fairly consistent across time.

The takeaway is that users are not reliable LLM evaluators.

My hypothesis is that users have a "learning curve", and get better at spotting LLM mistakes over time - both overall and for a specific model checkpoint. Resulting in increasingly critical evaluations over time.

ryoshu 5 days ago | parent | next [-]

Selection bias + perceptual adaptation is my experience. Selection bias happens when we play the probabilities of using an LLM and we only focus on the things it does really well, because it can be really amazing. When you use a model a lot you increasingly see when they don't work well your perception changes to focus on what doesn't work vs. the what does.

Living evals can solve for the quantitative issues with infra and model updates, but not sure how to deal with perceptual adaptation.

gowld 5 days ago | parent [-]

And survivor bias.

People who like the tool at first use it until they stop liking it -> "it got worse"

People who dislike the tool at first do not use it -> "it was bad"

rapind 5 days ago | parent | prev | next [-]

And yet, people's complaints about Claude Code over the past month and a bit are now justified by Anthropic stating that those complaints caused them to investigate and fix a bunch of issues (while investigating potential more issues with opus).

> But guess what? If you serve them open weights models, they also complain about models getting worse over time.

Isn't this also anecdotal, or is there data informing this statement?

I think you could be partially right, but I also don't think dismissing criticism as just being a change in perspective is correct either. At least some complaints are from power users who can usually tell when something is getting objectively worse (as was the case for some of us Claude Code users recently). I'm not saying we can't fool ourselves too, but I don't think that's the most likely assumption to make.

yazanobeidi 5 days ago | parent | prev [-]

You’re not wrong, but, I can literally see it get worse throughout the day sometimes, especially recently. Coinciding with Pacific Time Zone business hours.

Quantization could be done, not to deliberately make the model worse, but to increase reliability! Like Apple throttling devices - they were just trying to save your battery! After all there are regular outages, and some pretty major ones a handful of weeks back taking eg Opus offline for an entire afternoon.

SparkyMcUnicorn 5 days ago | parent | prev | next [-]

"or other factors" is pretty catch-all in my opinion.

> I don't know how they could have phrased it that closed the door on this concern

Agreed. A full legal document would probably be the only way to convince everyone.

j45 5 days ago | parent | prev [-]

Wording definitely could be clearer.

Intentionally might mean manually, or maybe the system does it on it's own when it thinks it's best.

pmx 5 days ago | parent | prev | next [-]

Frankly, I don't believe their claims that they don't degrade the models. I know we see models as less intelligent as we get used to them and their novelty wears off but I've had to entirely give up on Claude as a coding assistant because it seems to be incapable of following instructions anymore.

SparkyMcUnicorn 5 days ago | parent [-]

I'd believe a lot of other claims before believing model degradation was happening.

- They admittedly go off of "vibes" for system prompt updates[0]

- I've seen my coworkers making a lot of bad config and CLAUDE.md updates, MCP server span, etc. and claiming the model got worse. After running it with a clean slate, they redacted their claims.

[0] https://youtu.be/iF9iV4xponk?t=459

siva7 5 days ago | parent | prev [-]

Then check the news again. They already admitted that due to bugs model output was degraded for over a month

ACCount37 5 days ago | parent [-]

My link IS that news.