Remix.run Logo
gobdovan 5 hours ago

Is this insider info? The 'charted performance' caught my eye instantly. Couple things I find odd tho: why sawtooth? it would likely be square waves, as I'd imagine they roll down the cost-saving version quite fast per cohort. Also, aren't they unprofitable either way? Why would they do it for 'profitability'?

bonoboTP 5 hours ago | parent | next [-]

It's rumors based on vibes. There are attempts to track and quantify this with repeated model evaluations multiple times per day, this but no sawtooth pattern has emerged as far as I know.

chongli 43 minutes ago | parent [-]

I don't want to go too far down the conspiracy rabbit hole, but the vendors know everyone's prompts so it would be trivial for them to track the trackers and spoof the results. We already know that they substitute different models as a cost-saving measure, so substituting models to fool the repeated evaluations would be trivial.

We also already know that they actively seek out viral examples of poor performance on certain prompts (e.g. counting Rs in strawberry) and then monkey-patch them out with targeted training. How can we be sure they're not trying to spoof researchers who are tracking model performance? Heck, they might as well just call it "regression testing."

If their whole gig is an "emperor's new clothes" bubble situation, then we can expect them to try to uphold the masquerade as long as possible.

chongli 5 hours ago | parent | prev [-]

It's not insider info, it's common knowledge in the industry (Google model optimization). I think they are unprofitable either way, but unoptimized models burn runway a lot faster than optimized ones.

The reason it's not a square wave is because new optimization techniques are always in development, so you can't apply everything immediately after training the new model. I also think there's a marketing reason: if the performance of a brand new model declines rapidly after release then people are going to notice much more readily than with a gradual decline. The gradual decline is thus engineered by applying different optimizations gradually.

It also has the side benefit that the future next-gen model may be compared favourably with the current-gen optimized (degraded) model, setting up a rigged benchmark. If no one has access to the original pre-optimized current-gen model, no one can perform the "proper" comparison to be able to gauge the actual performance improvement.

Lastly, I would point out that vendors like OpenAI are already known to substitute previous-gen models if they determine your prompt is "simple." You should also count this as a (rather crude) optimization technique because it's going to degrade performance any time your prompt is falsely flagged as simple (false positive).