This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.

Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.

edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.

▲

gregsadetsky 14 hours ago | parent | next [-]

I commented on the forum asking Sarge whether they could share some of their test results.

If they do, I think that it will add a lot to this conversation. Hope it happens!

▲

icyfox 14 hours ago | parent | prev | next [-]

I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).

> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”

- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)

The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.

https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

▲

xg15 13 hours ago | parent | next [-]

Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.

Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.

I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.

▲

icyfox 13 hours ago | parent [-]

I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model.

As for the original forum post:

- Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)

- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data

	▲	xg15 13 hours ago \| parent [-]
		Good points. And I also agree we'd have to see the data that OP collected. If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.

▲

esafak 12 hours ago | parent | prev [-]

That's a great point. However, I think we can treat the serving pipeline as part and parcel of the model, for practical purposes. So it is dishonest of companies to say they haven't changed the model while undertaking such cost optimizations that impair the models' effective intelligence.

▲

colordrops 14 hours ago | parent | prev [-]

In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.

	▲	jonplackett 14 hours ago \| parent [-]
		This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.