Remix.run Logo
gobdovan 4 hours ago

Instead of anti-fragility, I'd point you to the law of requisite variety instead. You'll notice that all AI improvements are insanely good for a week or two after launch. Then you'll see people stating that 'models got worse'. What happened in fact is that people adapted to the tool, but the tool didn't adapt anymore. We're using AI as variety resistant and adaptable tools, but we miss the fact that most deployments nowadays do not adapt back to you as fast.

chongli 4 hours ago | parent [-]

New models literally do get worse after launch, due to optimization. If you charted performance over time, it'd look like a sawtooth, with a regular performance drop during each optimization period.

That's the dirty secret with all of this stuff: "state of the art" models are unprofitable due to high cost of inference before optimization. After optimization they still perform okay, but way below SOTA. It's like a knife that's been sharpened until razor sharp, then dulled shortly after.

girvo 3 hours ago | parent | next [-]

> If you charted performance over time, it'd look like a sawtooth

People have, though, and it doesn't show that. I think it's more people getting hit by the placebo effect, the novelty effect, followed by the models by-definition non-determinism leading people to say things like "the model got worse".

gobdovan 3 hours ago | parent | prev [-]

Is this insider info? The 'charted performance' caught my eye instantly. Couple things I find odd tho: why sawtooth? it would likely be square waves, as I'd imagine they roll down the cost-saving version quite fast per cohort. Also, aren't they unprofitable either way? Why would they do it for 'profitability'?

chongli 3 hours ago | parent | next [-]

It's not insider info, it's common knowledge in the industry (Google model optimization). I think they are unprofitable either way, but unoptimized models burn runway a lot faster than optimized ones.

The reason it's not a square wave is because new optimization techniques are always in development, so you can't apply everything immediately after training the new model. I also think there's a marketing reason: if the performance of a brand new model declines rapidly after release then people are going to notice much more readily than with a gradual decline. The gradual decline is thus engineered by applying different optimizations gradually.

It also has the side benefit that the future next-gen model may be compared favourably with the current-gen optimized (degraded) model, setting up a rigged benchmark. If no one has access to the original pre-optimized current-gen model, no one can perform the "proper" comparison to be able to gauge the actual performance improvement.

Lastly, I would point out that vendors like OpenAI are already known to substitute previous-gen models if they determine your prompt is "simple." You should also count this as a (rather crude) optimization technique because it's going to degrade performance any time your prompt is falsely flagged as simple (false positive).

bonoboTP 3 hours ago | parent | prev [-]

It's rumors based on vibes. There are attempts to track and quantify this with repeated model evaluations multiple times per day, this but no sawtooth pattern has emerged as far as I know.