Remix.run Logo
uludag a day ago

I'm curious what are others priors when reading benchmark scores. Obviously with immense funding at stakes, companies have every incentive to game the benchmarks, and the loss of goodwill from gaming the system doesn't appear to have much consequences.

Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.

jjmarr a day ago | parent | next [-]

> Obviously with immense funding at stakes, companies have every incentive to game the benchmarks, and the loss of goodwill from gaming the system doesn't appear to have much consequences.

Claude 3.7 Sonnet was consistently on top of OpenRouter in actual usage despite not gaming benchmarks.

candiddevmike a day ago | parent | prev | next [-]

People's interpretation of benchmarks will largely depend on whether they believe they will be better or worse off by GenAI taking over SWE jobs. Think you'd need someone outside the industry to weigh in to have a real, unbiased view.

douglasisshiny a day ago | parent [-]

Or someone who has been a developer for a decade plus trying to use these models on actual existing code bases, solving specific problems. In my experience, they waste time and money.

sandspar a day ago | parent [-]

These people are the most experienced, yes, but by the same token they also have the most incentive to disbelieve that an AI will take their job.

imiric a day ago | parent | prev | next [-]

Benchmark scores are marketing fluff. Just like the rest of this article with alleged praises from early adopters, and highly scripted and edited videos.

AI companies are grasping at straws by selling us minor improvements to stale technology so they can pump up whatever valuation they have left.

j_timberlake a day ago | parent [-]

The fact that people like you are still posting like this after Veo 3 is wild. Nothing could possibly be forcing you to hold onto that opinion, yet you come out in drones in every AI thread to repost it.

imiric 15 hours ago | parent [-]

I concede that my last sentence was partly hyperbolic, particularly around "stale technology". But the rest of what I wrote is an accurate description of the state of the AI industry, from the perspective of an unbiased outsider, anyway.

What we've seen from Veo 3 is impressive, and the technology is indisputably advancing. But at the same time we're flooded with inflated announcements from companies that create their own benchmarks or optimize their models specifically to look good on benchmarks. Yet when faced with real world tasks the same models still produce garbage, they need continuous hand-holding to be useful, and they often simply waste my time. At least, this has been my experience with Sonnet 3.5, 3.7, Gemini, o1, o3, and all of the SOTA models I've tried so far. So there's this dissonance between marketing and reality that's making it really difficult to trust what any of these companies say anymore.

Meanwhile, little thought is put into the harmful effects of these tools, and any alleged focus on "safety" is as fake as the hallucinations that plague them.

So, yes, I'm jaded by the state of the tech industry and where it's taking us, and I wish this bubble would burst already.

lewdwig a day ago | parent | prev | next [-]

Well-designed benchmarks have a public sample set and a private testing set. Models are free to train on the public set, but they can't game the benchmark or overfit the samples that way because they're only assessed on performance against examples they haven't seen.

Not all benchmarks are well-designed.

thousand_nights a day ago | parent [-]

but as soon as you test on your private testing set you're sending it to their servers so they have access to it

so effectively you can only guarantee a single use stays private

minimaxir a day ago | parent [-]

Claude does not train on API I/O.

> By default, we will not use your inputs or outputs from our commercial products to train our models.

> If you explicitly report feedback or bugs to us (for example via our feedback mechanisms as noted below), or otherwise explicitly opt in to our model training, then we may use the materials provided to train our models.

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

behindsight a day ago | parent [-]

Relying on their own policy does not mean they will adhere to it. We have already seen "rogue" employees in other companies conveniently violate their policies. Some notable examples were in the news within the month (eg: xAI).

Don't forget the previous scandals with Amazon and Apple both having to pay millions in settlements for eavesdropping with their assistants in the past.

Privacy with a system that phones an external server should not be expected, regardless of whatever public policy they proclaim.

Hence why GP said:

> so effectively you can only guarantee a single use stays private

iLoveOncall a day ago | parent | prev | next [-]

Hasn't it been proven many times that all those companies cheat on benchmarks?

I personally couldn't care less about them, especially when we've seen many times that the public's perception is absolutely not tied to the benchmarks (Llama 4, the recent OpenAI model that flopped, etc.).

sebzim4500 a day ago | parent [-]

I don't think there's any real evidence that any of the major companies are going out of their way to cheat the benchmarks. Problem is that, unless you put a lot of effort into avoiding contamination, you will inevitably end up with details about the benchmark in the training set.

blueprint a day ago | parent | prev [-]

kind of reminds me how they said they were increasing platform capabilities with Max and actually reduced them while charging a ton for it per month. Talk about a bait and switch. Lord help you if you tried to cancel your ill advised subscription during that product roll out as well - doubly so if you expect a support response.