Remix.run Logo
chipgap98 10 hours ago

Interesting that tasks on extra high cost almost the same as Opus 4.8 with a slightly worse performance

bredren 10 hours ago | parent | next [-]

This is on the browsercomp graph, right?

In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?

Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.

Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.

So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.

---

I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.

I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.

But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.

mchusma 10 hours ago | parent [-]

I agree with this assessment, IMO my takeaway from this is "Generally run Sonnet on low, otherwise use Opus". It's kind of like an "extra low" setting of Opus. (depends on the application for sure).

bredren 9 hours ago | parent [-]

It would be good if Anthropic provided some kind of feedback or even toggle to auto-route requests for models being used at thinking levels that would be a better value using a different model.

Sort of like, getting an automatic upgrade at a car rental or hotel if there is availability.

mcbuilder 10 hours ago | parent | prev [-]

LRMs are plateauing for sure, not that there won't be gains to be had in the future, but it's not like the era of rapid progress that was the past year any more.

gdhkgdhkvff 8 hours ago | parent | next [-]

I agree that the rapid improvement from like 2023-24 era is over (from a perspective of going from a 3/10 to a 7/10, you can’t then go to a 11/10). There was just so much more space to grow back then.

But isn’t Fable supposed to be another step change? I never used it, myself.

Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”

roughly 10 hours ago | parent | prev [-]

A great many people were predicting this would be the case a year ago and being told they were wrong and to get on the boat.

mcbuilder 9 hours ago | parent [-]

I consider myself to be in that cohort as well. :)