Remix.run Logo
johnfn 9 hours ago

That's just one benchmark, though. Tab to the next one and Sonnet 5 performs better as effort goes up just as you'd expect. I imagine the suggestion is that performance vs effort tradeoff is task dependent.

energy123 9 hours ago | parent [-]

No it doesn't? It's worse than Opus across the whole shared frontier on both plots.

acchow 7 hours ago | parent [-]

Agreed. The graphs clearly show that opus 4.8 performs strictly better at the same cost per task

jsnell 7 hours ago | parent [-]

But they don't show "strictly better" performance at cost per task!

The graphs show parts of the cost/performance pareto frontier occupied by Opus 4.8 and others occupied by Sonnet 5.0. If Opus 4.8 was strictly better at cost per task like you say, by definition the entire frontier would be occupied by Opus.

So neither is pareto-dominant over the other. In contrast, Sonnet 5.0 is Pareto-dominent over Sonnet 4.6 on those graphs.

energy123 6 hours ago | parent [-]

> by definition the entire frontier would be occupied by Opus.

But the entire frontier is occupied by Opus under any reasonable interpolation scheme (piecewise linear which is what they've done, and most reasonable spline or polynomial fits would also lead to the same result) over the overlapping x values for which both are defined.

Under that interpolation scheme, for x > ($ cost of Opus low effort), Opus is Pareto-dominant over Sonnet 5. You can see this by picking any point on Opus's interpolation and realizing that you get strictly worse by switching to Sonnet for the same x value or the same y value. Meaning if you want to pay the same $x then you get a worse y, or if you want the same y you pay more $x.

jsnell 6 hours ago | parent [-]

I really don't get what you're proposing. The cost ranges do not overlap at the low end. You can't (by definition!) interpolate outside of the range.

If you mean extrapolate, at that point you're just making up data. The available effort levels are discrete and covered totally by the benchmarks. You can draw on the monitor with a sharpie to show a "ultra-low" effort level for Opus that scores better than Sonnet "low" at the same price, but it doesn't magic the ultra-low effort into actual existence.

(Anyway, the blog post now has an errata and a graph that shows substantially better relative performance for Sonnet 5.0 than the original graph.)

energy123 6 hours ago | parent [-]

That's why I said "over the shared frontier" in my first post and more precisely in my second post I said "over the overlapping x values for which both are defined."

It was a claim that applies to a range of x-values where both curves are defined.

Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region. Which is what I understand to be your point?

jsnell 6 hours ago | parent [-]

The post I was replying to said "performs strictly better at the same cost per task". That claim was obviously not true, there are costs where Opus cannot do the task and Sonnet can, so Opus can't be performing strictly better that the same cost. It seems that you agree that it is not true.

You could make it true by artificially dropping some of the data points, but, like, why?

(Again, this is moot given the updated graph.)

> Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region.

Not so! It's only sound to do that at the low end of the cost axis (x) or the high end of the performance axis (y). You can't do it at the low end of the performance axis or the high end of the cost axis.