Remix.run Logo
ofirpress 2 hours ago

[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

Davidzheng 2 hours ago | parent | next [-]

but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)

botacode an hour ago | parent | next [-]

Load just makes LLMs behave less deterministically and likely degrade. See: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

They don't have to be malicious operators in this case. It just happens.

bgirard 30 minutes ago | parent | next [-]

> malicious

It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.

I care about -expected- performance when picking which model to use, not optimal benchmark performance.

altcognito 23 minutes ago | parent | prev [-]

Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.

megabless123 2 hours ago | parent | prev | next [-]

noob question: why would increased demand result in decreased intelligence?

exitb an hour ago | parent | next [-]

An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.

codeflo an hour ago | parent | next [-]

This is intentional? I think delivering lower quality than what was advertised and benchmarked is borderline fraud, but YMMV.

TedDallas an hour ago | parent | next [-]

Per Anthropic’s RCA linked in Ops post for September 2025 issues:

“… To state it plainly: We never reduce model quality due to demand, time of day, or server load. …”

So according to Anthropic they are not tweaking quality setting due to demand.

rootnod3 an hour ago | parent | next [-]

And according to Google, they always delete data if requested.

And according to Meta, they always give you ALL the data they have on you when requested.

entropicdrifter 18 minutes ago | parent | next [-]

>And according to Google, they always delete data if requested.

However, the request form is on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard'.

groundzeros2015 11 minutes ago | parent | prev [-]

What would you like?

cmrdporcupine 33 minutes ago | parent | prev | next [-]

I guess I just don't know how to square that with my actual experiences then.

I've seen sporadic drops in reasoning skills that made me feel like it was January 2025, not 2026 ... inconsistent.

root_axis 2 minutes ago | parent [-]

I wouldn't doubt that these companies would deliberately degrade performance to manage load, but it's also true that humans are notoriously terrible at identifying random distributions, even with something as simple as a coin flip. It's very possible that what you view as degredation just bad RNG.

16 minutes ago | parent | prev [-]
[deleted]
mcny an hour ago | parent | prev | next [-]

Personally, I'd rather get queued up on a long wait time I mean not ridiculously long but I am ok waiting five minutes to get correct it at least more correct responses.

Sure, I'll take a cup of coffee while I wait (:

lurking_swe an hour ago | parent [-]

i’d wait any amount of time lol.

at least i would KNOW it’s overloaded and i should use a different model, try again later, or just skip AI assistance for the task altogether.

direwolf20 an hour ago | parent | prev | next [-]

They don't advertise a certain quality. You take what they have or leave it.

denysvitali an hour ago | parent | prev | next [-]

If there's no way to check, then how can you claim it's fraud? :)

chrisjj an hour ago | parent | prev | next [-]

There is no level of quality advertised, as far as I can see.

bpavuk an hour ago | parent | prev | next [-]

> I think delivering lower quality than what was advertised and benchmarked is borderline fraud

welcome to the Silicon Valley, I guess. everything from Google Search to Uber is fraud. Uber is a classic example of this playbook, even.

copilot_king an hour ago | parent | prev [-]

If you aren't defrauding your customers you will be left behind in 2026

rootnod3 an hour ago | parent [-]

That number is a sliding window, isn't it?

sh3rl0ck 21 minutes ago | parent | prev [-]

I'd wager that lower tok/s vs lower quality of output would be two very different knobs to turn.

awestroke an hour ago | parent | prev | next [-]

I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load

vidarh 2 hours ago | parent | prev | next [-]

It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.

chrisjj an hour ago | parent [-]

They advertise the Opus 4.5 model. Secretly substituting a cheaper one to save costs would be fraud.

kingstnap 30 minutes ago | parent [-]

Old school Gemini used to do this. It was super obvious because mid day the model would go from stupid to completely brain dead. I have a screenshot of Google's FAQ on my PC from 2024-09-13 that says this (I took it to post to discord):

> How do I know which model Gemini is using in its responses?

> We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.

Wheaties466 an hour ago | parent | prev [-]

from what I understand this can come from the batching of requests.

chrisjj an hour ago | parent [-]

So, a known bug?

cmrdporcupine 2 hours ago | parent | prev [-]

I've personally witnessed large variability in behaviour even within a given session -- which makes sense as there's nothing stopping Anthropic from shuttling your context/session around load balanced through many different servers, some of which might be quantized heavily to manage load and others not at all.

I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.

TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.

epolanski 2 hours ago | parent [-]

I've defended opus in the last weeks but the degradation is tangible. It feels like it degraded by a generation tbh.

cmrdporcupine an hour ago | parent [-]

it's just extremely variable

mohsen1 2 hours ago | parent | prev | next [-]

Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

https://mafia-arena.com

ofirpress 2 hours ago | parent [-]

Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.

Dolores12 2 hours ago | parent | next [-]

so basically they know requests using your API key should be treated with care?

Deklomalo 2 hours ago | parent [-]

[dead]

epolanski 2 hours ago | parent | prev | next [-]

The last thing a proper benchmark should do is reveal it's own API key.

sejje an hour ago | parent | next [-]

That's a good thought I hadn't had, actually.

plagiarist 33 minutes ago | parent | prev [-]

IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.

mohsen1 2 hours ago | parent | prev [-]

yes I reached out to them but as you say it's a chicken-and-egg problem.

Thanks!

seunosewa an hour ago | parent | prev | next [-]

The degradation may be more significant within the day than at the same time every day.

GoatInGrey 38 minutes ago | parent [-]

Sure, but it's still useful insight to see how it performs over time. Of course, cynically, Anthropic could game the benchmark by routing this benchmark's specific prompts to an unadulterated instance of the model.

cedws 2 hours ago | parent | prev | next [-]

Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.

bredren 2 hours ago | parent [-]

For CC, I suspect it also need to be testing and labeling separate runs against subscription, public API and Bedrock-served models?

It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).

What we could also use is similar stuff for Codex, and eventually Gemini.

Really, the providers themselves should be running these tests and publishing the data.

The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.

chrisjj an hour ago | parent | prev | next [-]

> Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

Are you suggesting result accuracy varies with server load?

epolanski 2 hours ago | parent | prev | next [-]

Stilll relevant over time.

rootnod3 an hour ago | parent | prev | next [-]

Sorry what?

"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?

"Oh, you just measured me at bad times each day. On only 50 different queries."

So, what does that mean? I have to pick specific times during the day for Claude to code better?

Does Claude Code have office hours basically?

copilot_king an hour ago | parent [-]

> Does Claude Code have office hours basically?

Yes. Now pay up or you will be replaced.

rootnod3 44 minutes ago | parent [-]

Verily, my vichyssoise of verbiage veers most verbose, so let me run that thing out of tokens fast.

dana321 2 hours ago | parent | prev [-]

"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"

Aha, so the models do degrade under load.