Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.

Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.

I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.

▲

icyfox 13 hours ago | parent [-]

I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model.

As for the original forum post:

- Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)

- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data

	▲	xg15 13 hours ago \| parent [-]
		Good points. And I also agree we'd have to see the data that OP collected. If it indeed did show a slow decline over time and OpenAI did not change the weights, then something does not add up.