| ▲ | The LLM Lobotomy?(learn.microsoft.com) |
| 125 points by sgt3v 13 hours ago | 54 comments |
| |
|
| ▲ | esafak 12 hours ago | parent | next [-] |
| This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this. Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too. edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought. |
| |
| ▲ | gregsadetsky 12 hours ago | parent | next [-] | | I commented on the forum asking Sarge whether they could share some of their test results. If they do, I think that it will add a lot to this conversation. Hope it happens! | |
| ▲ | icyfox 12 hours ago | parent | prev | next [-] | | I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release). > Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.” - [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI) The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible. https://www.anthropic.com/engineering/a-postmortem-of-three-...
https://thinkingmachines.ai/blog/defeating-nondeterminism-in... | | |
| ▲ | xg15 11 hours ago | parent | next [-] | | Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described. Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released. I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there. | | |
| ▲ | icyfox 11 hours ago | parent [-] | | I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model. As for the original forum post: - Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum) - OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data |
| |
| ▲ | esafak 10 hours ago | parent | prev [-] | | That's a great point. However, I think we can treat the serving pipeline as part and parcel of the model, for practical purposes. So it is dishonest of companies to say they haven't changed the model while undertaking such cost optimizations that impair the models' effective intelligence. |
| |
| ▲ | colordrops 12 hours ago | parent | prev [-] | | In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well. | | |
| ▲ | jonplackett 12 hours ago | parent [-] | | This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially. |
|
|
|
| ▲ | briga 12 hours ago | parent | prev | next [-] |
| I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage. |
| |
| ▲ | vintermann 12 hours ago | parent | next [-] | | They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line? Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format. | | |
| ▲ | lostmsu 3 hours ago | parent | next [-] | | With T=0 on the same model you should get the same exact output text. If they are not getting it, other environmental factors invalidate the test result. | |
| ▲ | Spivak 10 hours ago | parent | prev [-] | | If it was random luck, wouldn't you expect about half the answers to be better? Assuming the OP isn't lying I don't think there's much room for luck when you get all the questions wrong on a T/F test. |
| |
| ▲ | nothrabannosir 12 hours ago | parent | prev | next [-] | | TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end. What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest. | |
| ▲ | chaos_emergent 12 hours ago | parent | prev | next [-] | | Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change. | |
| ▲ | zzzeek 12 hours ago | parent | prev | next [-] | | your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.) | |
| ▲ | gtsop 10 hours ago | parent | prev | next [-] | | I see your point but no, it's getting objectively worse. I have a similar experience of casually using chatgpt for various use cases, when 5 dropped i noticed it was very fast but oddly got some details off. As time moved on it became both slower and the output deteriorated. | |
| ▲ | yieldcrv 12 hours ago | parent | prev | next [-] | | fta: “I am glad I have proof of this with the test system” I think they have receipts, but did not post them there | | |
| ▲ | Aurornis 12 hours ago | parent [-] | | A lot of the claims I’ve seen have claimed to have proof, but details are never shared. Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim. | | |
| ▲ | yieldcrv 11 hours ago | parent [-] | | That's been my experience too but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now. so perhaps it's just a matter of transparency but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model |
|
| |
| ▲ | colordrops 12 hours ago | parent | prev [-] | | Did any of you read the article? They have a test framework that objectively shows the model getting worse over time. | | |
| ▲ | Aurornis 12 hours ago | parent [-] | | I read the article. No proof was included. Not even a graph of declining results. |
|
|
|
| ▲ | ProjectArcturis 12 hours ago | parent | prev | next [-] |
| I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT? That said, I would also love to see some examples or data, instead of just "it's getting worse". |
| |
| ▲ | SBArbeit 11 hours ago | parent | next [-] | | I know that OpenAI has made computing deals with other companies, and as time goes on, the percentage of inference that they run their models on will shift, but I doubt that much, if any, of that has moved from Microsoft Azure data centers yet, so that's not a reason for difference in model performance. With that said, Microsoft has a different level of responsibility, both to its customers and to its stakeholders, to provide safety than OpenAI or any other frontier provider. That's not a criticism of OpenAI or Anthropic or anyone else, who I believe are all trying their best to provide safe usage. (Well, other than xAI and Grok, for which the lack of safety is a feature, not a bug.) The risk to Microsoft of getting this wrong is simply higher than it is for other companies, and what's why they have a strong focus on Responsible AI (RAI) [1]. I don't know the details, but I have to assume there's a layer of RAI processing on models through Azure OpenAI that's not there for just using OpenAI models directly through the OpenAI API. That layer is valuable for the companies who choose to run their inference through Azure, who also want to maximize safety. I wonder if that's where some of the observed changes are coming from. I hope the commenter posts their proof for further inspection. It would help everyone. [1]: https://www.microsoft.com/en-us/ai/responsible-ai | |
| ▲ | SubiculumCode 12 hours ago | parent | prev [-] | | I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI. | | |
| ▲ | transcriptase 12 hours ago | parent | next [-] | | Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM. | |
| ▲ | bongodongobob 12 hours ago | parent | prev [-] | | They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts. |
|
|
|
| ▲ | juliangoldsmith 8 hours ago | parent | prev | next [-] |
| I've been using Azure AI Foundry for an ongoing project, and have been extremely dissatisfied. The first issue I ran into was with them not supporting LLaMA for tool calls. Microsoft stated in February that they were working on it [0], and they were just closing the ticket because they were tracking it internally. I'm not sure why they've been unable to do what took me two hours in over six months, but I am sure they wouldn't be upset by me using the much more expensive OpenAI models. There are also consistent performance issues, even on small models, as mentioned elsewhere. This is with a rate on the order of one per minute. You can solve that with provisioned throughput units. The cheapest option is one of the GPT models, at a minimum of $10k/month (a bit under half the cost of just renting an A100 server). DeepSeek was a minimum of around $72k/month. I don't remember there being any other non-OpenAI models with a provisioned option. Given that current usage without provisioning is approximately in the single dollars per month, I have some doubts as to whether we'd be getting our money's worth having to provision capacity. |
|
| ▲ | cush 12 hours ago | parent | prev | next [-] |
| Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal? |
| |
| ▲ | criemen 12 hours ago | parent | next [-] | | Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult. | | |
| ▲ | visarga 11 hours ago | parent [-] | | It's because batch size is dynamic. So a different batch size will change the output even on temp 0. |
| |
| ▲ | jonplackett 12 hours ago | parent | prev | next [-] | | It could be that performance on temp zero has declined but performance on a normal temp is the same or better. I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle. | |
| ▲ | Spivak 10 hours ago | parent | prev | next [-] | | I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes. That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains. I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models. | |
| ▲ | fortyseven 12 hours ago | parent | prev [-] | | I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that? | | |
|
|
| ▲ | mmh0000 10 hours ago | parent | prev | next [-] |
| I've noticed this with Claude Code recently. A few weeks ago, Claude was "amazing" in that I could feed it some context and a specification, and it could generate mostly correct code and refine it in a few prompts. Now, I can try the same things, and Claude gets it terribly wrong and works itself into problems it can't find its way out of. The cynical side of me thinks this is being done on purpose, not to save Anthropic money, but to make more money by burning tokens. |
|
| ▲ | cjtrowbridge 9 hours ago | parent | prev | next [-] |
| This brings up a point many will not be aware of. If you know the random seed and the prompt, and the hash of the model's binary file; the output is completely deterministic. You can use this information to check whether they are in fact swapping your requests out to cheaper models than what you're paying for. This level of auditability is a strong argument for using open-source, commodified models, because you can easily check if the vendor is ripping you off. |
| |
| ▲ | TZubiri 7 hours ago | parent [-] | | Pretty sure this is wrong, requests are batched and size can affect the output, also gpus are highly parallel, there can be many race conditions. | | |
| ▲ | TeMPOraL 31 minutes ago | parent [-] | | Yup. Floating point math turns race conditions into numerical errors, reintroducing non-determinism regardless of inputs used. |
|
|
|
| ▲ | gwynforthewyn 12 hours ago | parent | prev | next [-] |
| What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with. What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere. |
| |
| ▲ | bn-l 11 hours ago | parent [-] | | Maybe to prompt more anecdotes on how gpt-$ is the money making gpt—where they gut quality and hold prices steady to reduce losses? I can tell you that the post describes is exactly what I’ve seen also: degraded performance and excruciatingly slow. |
|
|
| ▲ | romperstomper 9 hours ago | parent | prev | next [-] |
| Could it be a result of a caching of some sort? I suppose in case of LLM they can't make a direct cache but they could group prompts using embeddings and produce some most common result maybe? (this is just a theory) |
|
| ▲ | jug 12 hours ago | parent | prev | next [-] |
| At least on OpenRouter, you can often verify what quant a provider is using for a particular model. |
|
| ▲ | bigchillin 12 hours ago | parent | prev | next [-] |
| This is why we have open source.
I noticed this with cursor, it’s not just an azure problem. |
|
| ▲ | SirensOfTitan 12 hours ago | parent | prev | next [-] |
| I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too. I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice. |
| |
| ▲ | bn-l 11 hours ago | parent [-] | | You can be clever with language also. You can say “we never intentionally degrade model performance” and then claim you had no idea a quant would make perf worse because it was meant to make it better (faster). |
|
|
| ▲ | ukFxqnLa2sBSBf6 12 hours ago | parent | prev | next [-] |
| It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about. |
|
| ▲ | mehdibl 11 hours ago | parent | prev | next [-] |
| Since when LLM become deterministic? |
| |
| ▲ | thomasmg 10 hours ago | parent [-] | | LLM are just software + data and can be made deterministic, in the same way a pseudo random number generator can be made deterministic by using the same seed. For an LLM, you typically set temperature to 0, or set the random seed to the same value, run it on the same hardware (or emulation) or otherwise ensure the (floating point) calculations get the exact same results. I think that's it. In reality, yes it's not that easy, but it's possible. | | |
| ▲ | mr_toad 9 hours ago | parent [-] | | Unfortunately because floating point addition isn’t always associative, and because GPUs don’t always perform calculations in the same order you won’t always get the same result even with a temperature of zero. |
|
|
|
| ▲ | ant6n 11 hours ago | parent | prev | next [-] |
| I used to think running your own local model is silly because it’s slow and expensive, but the nerfing of ChatGPT and Gemini is so aggressive it’s starting to make a lot more sense. I want the smartest model, and I don’t want to second guess some potentially quantized black box. |
|
| ▲ | zzzeek 12 hours ago | parent | prev | next [-] |
| I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price. |
|
| ▲ | bbminner 10 hours ago | parent | prev [-] |
| Am I the only person who can sense the exact moment an LLM-written response kicked in? :) "sharing some of the test results/numbers you have would truly help cement this case!" - c'mon :) |
| |
| ▲ | gregsadetsky 10 hours ago | parent [-] | | I actually 100% wrote that comment myself haha!! See https://news.ycombinator.com/item?id=45316437 I think it would have sounded more reasonable in French, which is my actual native tongue. (i.e. I subconsciously translate from French when I'm writing in English) ((this comment was also written without AI!!)) :-) |
|