Remix.run Logo
Claude Code Daily Benchmarks for Degradation Tracking(marginlab.ai)
172 points by qwesr123 2 hours ago | 66 comments
ofirpress an hour ago | parent | next [-]

[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

Davidzheng 33 minutes ago | parent | next [-]

but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)

megabless123 21 minutes ago | parent | next [-]

noob question: why would increased demand result in decreased intelligence?

exitb 8 minutes ago | parent | next [-]

An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.

vidarh 15 minutes ago | parent | prev | next [-]

It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.

awestroke 9 minutes ago | parent | prev | next [-]

I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load

Wheaties466 13 minutes ago | parent | prev [-]

from what I understand this can come from the batching of requests.

cmrdporcupine 30 minutes ago | parent | prev [-]

I've personally witnessed large variability in behaviour even within a given session -- which makes sense as there's nothing stopping Anthropic from shuttling your context/session around load balanced through many different servers, some of which might be quantized heavily to manage load and others not at all.

I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.

TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.

epolanski 18 minutes ago | parent [-]

I've defended opus in the last weeks but the degradation is tangible. It feels like it degraded by a generation tbh.

cmrdporcupine 5 minutes ago | parent [-]

it's just extremely variable

mohsen1 an hour ago | parent | prev | next [-]

Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

https://mafia-arena.com

ofirpress an hour ago | parent [-]

Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.

Dolores12 38 minutes ago | parent | next [-]

so basically they know requests using your API key should be treated with care?

epolanski 25 minutes ago | parent | prev | next [-]

The last thing a proper benchmark should do is reveal it's own API key.

mohsen1 an hour ago | parent | prev [-]

yes I reached out to them but as you say it's a chicken-and-egg problem.

Thanks!

epolanski 26 minutes ago | parent | prev | next [-]

Stilll relevant over time.

cedws 37 minutes ago | parent | prev | next [-]

Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.

bredren 24 minutes ago | parent [-]

For CC, I suspect it also need to be testing and labeling separate runs against subscription, public API and Bedrock-served models?

It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).

What we could also use is similar stuff for Codex, and eventually Gemini.

Really, the providers themselves should be running these tests and publishing the data.

The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.

dana321 21 minutes ago | parent | prev [-]

"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"

Aha, so the models do degrade under load.

antirez an hour ago | parent | prev | next [-]

Why I do not believe this shows Anthropic serves folks a worse model:

1. The percentage drop is too low and oscillating, it goes up and down.

2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.

3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.

Dowwie an hour ago | parent | prev | next [-]

Simply search user prompts for curse words and then measure hostility sentiment. User hostility rises as agents fail to meet expectations.

preuceian 23 minutes ago | parent | next [-]

Maybe im overlooking something obvious but how do you 'simply' scan the content of Claude users their prompts?

mrbananagrabber an hour ago | parent | prev | next [-]

I uh might be skewing that as I generally just use a lot of curse words with Claude by default

ctxc an hour ago | parent | prev | next [-]

I feel bad about it but sometimes it's so daft, I can't even xD

It's not my fault, they set high standards!

Trufa an hour ago | parent | prev | next [-]

I'm glad I'm not the only one.

smotched an hour ago | parent | prev [-]

there are many times where I just do it myself and it thinks it did well.

silverlight an hour ago | parent | prev | next [-]

There was a moment about a week ago where Claude went down for about an hour. And right after it came back up it was clear a lot of people had given up and were not using it.

It was probably 3x faster than usual. I got more done in the next hour with it than I do in half a day usually. It was definitely a bit of a glimpse into a potential future of “what if these things weren’t resource constrained and could just fly”.

yoavsha1 an hour ago | parent | next [-]

I had that exact same feeling during the US holidays where I got to enjoy 2x usage limits and everything just seemed to work well

cmrdporcupine 28 minutes ago | parent [-]

I had terrible results during the holidays -- it wasn't slow but it was clear they were dealing with the load by quantizing in spots because there were entire chunks of days when the results from it were so terrible I gave up and switched to using Gemini or Codex via opencode.

svdr 23 minutes ago | parent | prev [-]

I would also regret it if they become that fast; right now I can really take a moment to enjoy the hard work the model is doing for me.

dajonker an hour ago | parent | prev | next [-]

Wouldn't be surprised if they slowly start quantizing their models over time. Makes it easier to scale and reduce operational cost. Also makes a new release have more impact as it will be more notably "better" than what you've been using the past couple of days/weeks.

rustyhancock 19 minutes ago | parent | next [-]

Oooff yes I think that is exactly the kind of shenanigans they might pull.

Ultimately I can understand if a new model is coming in without as much optimization then it'll add pressure to the older models achieving the same result.

Nice plausible deniability for a convenient double effect.

YetAnotherNick 34 minutes ago | parent | prev [-]

Benchmarks like ARG AGI are super price correlated and cheap to run. I think it's very easy to prove that the models are degrading.

jampa 17 minutes ago | parent | prev | next [-]

I am using API mode, and it's clear that there are times when the Claude model just gives up. And it is very noticeable because the model just does the most dumb things possible.

"You have a bug in line 23." "Oh yes, this solution is bugged, let me delete the whole feature." That one-line fix I could make even with ChatGPT 3.5 can't just happen. Workflows that I use and are very reproducible start to flake and then fail.

After a certain number of tokens per day, it becomes unusable. I like Claude, but I don't understand why they would do this.

arcanemachiner 6 minutes ago | parent [-]

Robbing Peter to pay Paul. They are probably resource-constrained, and have determined that it's better to supply a worse answer to more people than to supply a good answer to some while refusing others. Especially knowing that most people probably don't need the best answer 100% of the time.

stared 19 minutes ago | parent | prev | next [-]

Does it benchmark the underlying code (Opus 4.5) or Claude Code harness? If the second, I would love to see CC versions involved.

I would be curious to see on how it fares against a constant harness.

There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157

So it would be wonderful to measure these.

Jcampuzano2 14 minutes ago | parent [-]

Claude Code. They mention they are using claude codes CLI in the benchmark, and claude code changes constantly.

I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.

I wouldn't really trust this to be able to benchmark opus itself.

WhitneyLand 16 minutes ago | parent | prev | next [-]

First off, this is a cool project, look forward to some interesting insights.

I would suggest adding some clarification to note that longer measure like 30 pass rate is raw data only while the statistically significant labels apply only to change.

Maybe something like Includes all trials, significance labels apply only to confidence in change vs baseline.

qwesr123 2 hours ago | parent | prev | next [-]

FYI the MarginLab Claude Code degradation tracker is showing a statistically significant ~4% drop in SWE-Bench-Pro accuracy over the past month

goldenarm an hour ago | parent | prev | next [-]

I really like the idea, but a "±14.0% significance threshold" is meaningless here.

The larger monthly scale should be the default, or you should get more samples.

zacmps an hour ago | parent [-]

Could you elaborate what you think the problems are? I guess they should be using some form of multiple comparison correction?

goldenarm an hour ago | parent [-]

The daily scale is not statistically significant and is meaningless. You should lower the confidence interval by either increasing the scale or the evaluations.

ghm2199 an hour ago | parent | prev | next [-]

In medicine there is a concept of reporting adverse effects of medication or interventions which are then collectively studied for Public Health [MedWatch][VAERS][EudraVigilance] and in academia. We should have something like that for all coding agents(and agents in other fields too), given how widely its deployed and affect on "health" in general(not only human). Call it the AI "health" of things benchmark.

I would imagine a sort of hybrid qualities of volunteer efforts like wikipedia, new problems like advent of code and benchmarks like this. The goal? It would be to study the collective effort on the affects of usage to so many areas where AI is used.

[MedWatch](https://www.fda.gov/safety/medwatch-fda-safety-information-a...)

[VAERS](https://www.cdc.gov/vaccine-safety-systems/vaers/index.html)

[EudraVigilance](https://www.ema.europa.eu/en/human-regulatory-overview/resea...)

taf2 12 minutes ago | parent | prev | next [-]

any chance we can get something like this for codex cli that'd be cool too compare

beardsciences an hour ago | parent | prev | next [-]

Very interesting. I would be curious to understand how granular these updates are being applied to CC + what might be causing things like this. I feel like I can notice a very small degradation but have compensated with more detailed prompts (which I think, perhaps naively, is offsetting this issue).

sciencejerk an hour ago | parent | prev | next [-]

Why is this happening?

Trufa an hour ago | parent | next [-]

I have absolutely no insight knowledge, but I think it's not a bad assumption to have that, it's costly to run the models, when they release a new model they assume that cost and give per user more raw power, when they've captured the new users and wow factor, they start reducing costs by reducing the capacity they provide to users. Rinse and repeat.

observationist 16 minutes ago | parent | prev | next [-]

They're "optimizing" costs wherever possible - reducing compute allocations, quantizing models, doing whatever they can to reduce the cost per token, but vehemently insisting that no such things are occurring, that it's all in the users' heads, and using the weaseliest of corporate weasel speak to explain what's happening. They insist it's not happening, then they say something like "oh, it happened but it was an accident", then they say "yes, it's happening, but it's actually good!" and "we serve the same model day by day, and we've always been at war with Eastasia."

They should be transparent and tell customers that they're trying to not lose money, but that'd entail telling people why they're paying for service they're not getting. I suspect it's probably not legal to do a bait and switch like that, but this is pretty novel legal territory.

Uehreka an hour ago | parent | prev | next [-]

There are frequently claims that Anthropic is somehow diluting or dumbing down models in some subtle way. Unfortunately it’s tough to validate these claims without a body of regularly checked evals. This test set should hopefully help settle whether Anthropic is actually making changes under the hood or whether the changes are all in people’s heads.

giwook an hour ago | parent | prev [-]

https://www.anthropic.com/engineering/a-postmortem-of-three-...

observationist 14 minutes ago | parent [-]

>>> We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone.

Just ignore the continual degradation of service day over day, long after the "infrastructure bugs" have reportedly been solved.

Oh, and I've got a bridge in Brooklyn to sell ya, it's a great deal!

fragebogen an hour ago | parent | prev | next [-]

Would love to see this idea expanded to ever alleged SoTA model currently in production. Any speculation as to why this degradation occurs?

embedding-shape an hour ago | parent [-]

Anecdote, I don't have any proof and it's just a feeling. But around afternoon in GMT+1 compared to the morning/midday, there seems to be a change in the quality of responses, which seems to line up with when the US wakes up. I consistently get (what feels like) worse responses in both Codex and Claude Code in the afternoon/night compared to morning/midday, so much that I usually give up then try the same prompt next morning and get better results. But I guess that might as well be about me being more tired in the night than morning too, as I said, haven't measured this.

jzig an hour ago | parent [-]

It’s the afternoon slump. The AI needs a cup of coffee and to doomscroll for half an hour!

embedding-shape an hour ago | parent [-]

Or a load balancing technique :) Either way, it kicks me off to do other things so maybe it isn't so bad after all.

sroerick 37 minutes ago | parent | prev | next [-]

My personal conspiracy theory is that they choose who to serve a degraded model to based on social graph analysis and sentiment analysis, maximizing for persuasion while minimizing compute.

IshKebab an hour ago | parent | prev | next [-]

> We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.

Doesn't really work like that. I'd remove the "statistically significant" labelling because it's misleading.

turnsout an hour ago | parent | prev [-]

This is probably entirely down to subtle changes to CC prompts/tools.

I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.

Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?

FfejL an hour ago | parent | next [-]

Honest, good-faith question.

Is CC getting better, or are you getting better at using it? And how do you know the difference?

I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.

rob 40 minutes ago | parent | next [-]

I agree with you, it's personally hard to tell.

For me I've noticed it getting nothing but better over the past couple months, but I've been working on my workflows and tooling.

For example, I used to use plan mode and would put everything in a single file and then ask it to implement it in a new session.

Switching to the 'superpowers' plugin with its own skills to brainstorm and write plans and execute plans with batches and tasks seems to have made a big improvement and help catch things I wouldn't have before. There's a "get shit done" plugin that's similar that I want to explore as well.

The code output always looks good to me for the most part though and I've never thought that it's getting dumber anything, so I feel like a lot of the improvements I see are because of a skill issue on my part trying to use everything. Obviously it doesn't help there's a new way to do things every two weeks though.

turnsout 40 minutes ago | parent | prev [-]

Good-faith answer: I can't be certain. But I've been using CC since its release, and Cursor before that (and actually going all the way back to GPT3 to do codegen in the Playground). After getting used to the CC workflow, the way that I use it has been pretty consistent. To be specific, I use basically the same AGENTS.md with small modifications for each project, and I live almost exclusively in Plan mode and the best model (currently Opus 4.5).

My initial prompting is boilerplate at this point, and looks like this:

(Explain overall objective / problem without jumping to a solution)

(Provide all the detail / file references / past work I can think of)

(Ask it "what questions do you have for me before we build a plan?")

And then go back and forth until we have a plan.

Compared to my work with CC six months ago, it's just much more capable, able to solve more nuanced bugs, and less likely to generate spaghetti code.

billylo an hour ago | parent | prev | next [-]

That's why benchmarks are useful. We all suffer from the shortcomings of human perception.

gpm an hour ago | parent | next [-]

Benchmarks shortcomings are no worse... they inevitably measure something that is only close to the thing you actually care about, not the thing you actually care about. It's entirely plausible that this decreased benchmark score is because Anthropic's initial prompting of the model was overtuned to the benchmark and as they're gaining more experience with real world use they are changing the prompt to do better at that and consequentially worse at the benchmark.

billylo an hour ago | parent [-]

I wonder how best we can measure the usefulness of models going forward.

Thumbs up or down? (could be useful for trends) Usage growth from the same user over time? (as an approximation) Tone of user responses? (Don't do this... this is the wrong path... etc.)

turnsout 39 minutes ago | parent | prev [-]

Benchmarks measure what they measure. But your subjective experience also matters.

fragebogen an hour ago | parent | prev [-]

I was going to ask, are all other variables accounted for? Are we really comparing apples to apples here? Still worth doing obviously, as it serves a good e2e evaluations, just for curiosity's sake.