Remix.run Logo
chongli 4 hours ago

New models literally do get worse after launch, due to optimization. If you charted performance over time, it'd look like a sawtooth, with a regular performance drop during each optimization period.

That's the dirty secret with all of this stuff: "state of the art" models are unprofitable due to high cost of inference before optimization. After optimization they still perform okay, but way below SOTA. It's like a knife that's been sharpened until razor sharp, then dulled shortly after.

girvo 3 hours ago | parent | next [-]

> If you charted performance over time, it'd look like a sawtooth

People have, though, and it doesn't show that. I think it's more people getting hit by the placebo effect, the novelty effect, followed by the models by-definition non-determinism leading people to say things like "the model got worse".

gobdovan 3 hours ago | parent | prev [-]

Is this insider info? The 'charted performance' caught my eye instantly. Couple things I find odd tho: why sawtooth? it would likely be square waves, as I'd imagine they roll down the cost-saving version quite fast per cohort. Also, aren't they unprofitable either way? Why would they do it for 'profitability'?

chongli 3 hours ago | parent | next [-]

It's not insider info, it's common knowledge in the industry (Google model optimization). I think they are unprofitable either way, but unoptimized models burn runway a lot faster than optimized ones.

The reason it's not a square wave is because new optimization techniques are always in development, so you can't apply everything immediately after training the new model. I also think there's a marketing reason: if the performance of a brand new model declines rapidly after release then people are going to notice much more readily than with a gradual decline. The gradual decline is thus engineered by applying different optimizations gradually.

It also has the side benefit that the future next-gen model may be compared favourably with the current-gen optimized (degraded) model, setting up a rigged benchmark. If no one has access to the original pre-optimized current-gen model, no one can perform the "proper" comparison to be able to gauge the actual performance improvement.

Lastly, I would point out that vendors like OpenAI are already known to substitute previous-gen models if they determine your prompt is "simple." You should also count this as a (rather crude) optimization technique because it's going to degrade performance any time your prompt is falsely flagged as simple (false positive).

bonoboTP 3 hours ago | parent | prev [-]

It's rumors based on vibes. There are attempts to track and quantify this with repeated model evaluations multiple times per day, this but no sawtooth pattern has emerged as far as I know.