Remix.run Logo
tropicalfruit 5 days ago

reading all the shilling of Claude and GPT i see here often I feel like i'm being gaslighted

i've been using premium tiers of both for a long time and i really felt like they've been getting worse

especially Claude I find super frustrating and maddening, misunderstanding basic requests or taking liberties by making unrequested additions and changes

i really had this sense of enshittification, almost as if they are no longer trying to serve my requests but do something else instead like i'm victim of some kind of LLM a/b testing to see how far I can tolerate or how much mental load can be transferred back onto me

tibbar 5 days ago | parent | next [-]

While it's possible that the LLMs are intentionally throttled to save costs, I would also keep in mind that LLMs are now being optimized for new kinds of workflows, like long-running agents making tool calls. It's not hard to imagine that improving performance on one of those benchmarks comes at a cost to some existing features.

macawfish 5 days ago | parent | prev | next [-]

I suspect that it may not necessarily be that they're getting objectively _worse_ as much as that they aren't static products. They're constantly getting their prompts/context engines tweaked in ways that surely break peoples' familiar patterns. There really needs to be a way to cheaply and easily anchor behaviors so that people can get more consistency. Either that or we're just going to have to learn to adapt.

simonw 5 days ago | parent | prev | next [-]

Anthropic have stated on the record several times that they do not update the model weights once they have been deployed without also changing the model ID.

jjani 5 days ago | parent [-]

No, they do change deployed models.

How can I be so sure? Evals. There was a point where Sonnet 3.5 v2 happily output 40k+ tokens in one message if asked. And one day it started with 99% consistency, outputting "Would you like me to continue?" after a lot fewer tokens than that. We'd been running the same set of evals and so could definitively confirm this change. Googling will also reveal many reports of this.

Whatever they did, in practice they lied: API behavior of a deployed model changed.

Another one: Differing performance - not latency but output on the same prompt, over 100+ runs, statistically significant enough to be impossible by random chance - between AWS Bedrock hosted Sonnet and direct Anthropic API Sonnet, same model version.

Don't take at face value what model providers claim.

simonw 5 days ago | parent [-]

If they are lying about changing model weights despite keeping the date-stamped model ID the same it would be a monumental lie.

Anthropic make most of their revenue from paid API usage. Their paying customers need to be able to trust them when they make clear statements about their model deprecation policy.

I'm going to chose to continue to believe them until someone shows me incontrovertible evidence that this isn't true.

saurik 5 days ago | parent | next [-]

Maybe they are not changing the model weights but they are making constant tweaks to the system prompt (which isn't in any way better, to be extremely clear).

simonw 5 days ago | parent [-]

That affects their consumer apps but not models accessed via their API.

Unlike other providers they do at least publish part of the system prompts - though they omit the tool section, I wish they'd publish the whole thing!

jjani 5 days ago | parent | prev [-]

That's a very roundabout way to phrase "you're completely making all of this up", which is quite disappointing tbh. Are you familiar with evals? As in automated testing using multiple runs? It's simple regression testing, just like for deterministic code. Doing multiple runs smooths out any stochastic differences, and the change I explained isn't explainable by stochasticity regardless.

There is no evidence that would satisfy you then, as it would be exactly what I showed. You'd need a time machine.

https://www.reddit.com/r/ClaudeAI/comments/1gxa76p/claude_ap...

Here's just one thread.

simonw 5 days ago | parent [-]

I don't think you're making it up, but without a lot more details I can't be convinced that your methodology was robust enough to prove what you say it shows.

There IS evidence that would satisfy me, but I'd need to see it.

I will have a high bar for that though. A Reddit thread of screenshots from nine months ago doesn't do the trick for me.

(Having looked at that thread it doesn't look like a change in model weights to me, it looks more like a temporary capacity glitch in serving them.)

jjani 5 days ago | parent [-]

This was nothing but "temporary", it's still in place; the last time we ran the evals is 2 weeks ago and it's the exact same. It can't be a "capacity glitch" either, as it actually outputs those as proper tokens.

It's possible that it was an internal system prompt change despite the claims of "there is no system prompt on the API", but this is in effect the same as changing the model.

> There IS evidence that would satisfy me, but I'd need to see it.

Describe what this evidence would look like. It sure feels like an appeal to authority - if I'd be someone with a "name" I'm sure you'd believe it.

If you'd had had the same set of evals set up since then, you wouldn't have questioned this at all. You don't.

> I don't think you're making it up, but without a lot more details I can't be convinced that your methodology was robust enough to prove what you say it shows.

Go and poke holes at it then, go on. I've clearly explained the methodology.

TechDebtDevin 5 days ago | parent | prev [-]

If Anthropic made Deepthink 3.5 it would be AGI, I never use > 3.5