Remix.run Logo
doctoboggan 10 hours ago

The cost per task chart is telling me that I should _never_ use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels.

jimbo808 3 hours ago | parent | next [-]

They're actively trying to use lobbying power to make open weight models illegal. So I'm just not going to use their services at all anymore. I don't think they're a net gain if you're a skilled senior, and the hidden cost in terms of technical debt and skill atrophy is just being swept under the rug. I'll be okay without their bullshit generator.

anonzzzies an hour ago | parent | next [-]

Sure about Dario (and all billionaire) weirdness, but no gains if you are a skilled senior is well, very far out in our experience (our company is 30 years old with mostly the original employees and founders): what we deliver now at the speed and quality we deliver it would have been impossible 10 years ago with our team size of skilled seniors. We replaced all the commercial products our clients and ourselves used with our own, giving us millions more revenue and profit with the upselling and efficiency benefits. We work for regulated clients: our code is reviewed, pentested and audited regularly by us and 3rd parties so its not slop either. You are definitely leaving money on the table. We do mostly use chinese models on our own hardware (we colocate cages of racks) so this is not about Anthropic but about AI in general.

Skill athrophy is a real thing though; we try to prevent this by have hackethons (for lack of a better word) without AI where I pick something extremely non trivial and we implement it for fun and profit without AI (with would not matter much as they are currently bad at these things); last one was flex paxos for our in house db with obvious metrics for the endresult: data integrity (duh) under failure and performance better or at least the same as our raft production version.

andyroid 18 minutes ago | parent [-]

> We replaced all the commercial products our clients and ourselves used with our own

You’ll never guess what product your clients are looking to replace with their own next.

anonzzzies 16 minutes ago | parent | next [-]

Sure, that is why you need to be early. I fully believe my company won't make it another 30 years (or 10), so we prepare for that. Also, I will be dead by then, but that is unrelated.

For now everyone is still sufficiently crap at using AI to need help. We had enough clients trying to build something themselves and then come crying to us.

hughw 15 minutes ago | parent | prev [-]

Sure but in the intervening 2 years there's money to be made.

mastazi an hour ago | parent | prev | next [-]

good luck actually enforcing that.

andsoitis 2 hours ago | parent | prev [-]

> They're actively trying to use lobbying power to make open weight models illegal.

What is your evidence?

Robdel12 2 hours ago | parent | next [-]

Dario’s own mouth https://x.com/coinbureau/status/2071330294452666695/mediavie...

anon373839 an hour ago | parent | next [-]

He has also been telling bald-faced lies about open source/open weights models that are easily disproved. For example, he claimed that they lack the collaborative benefits of open source because "we can't see inside the model".

Open weights models are responsible for enabling reams of research on interpretability methods that do just that. And they have facilitated so much collaboration on architecture, inference optimizations, training and steering methods, and other topics that were completely out of reach with closed models like Anthropic's. It's really staggering to me.

ta93754829 an hour ago | parent | prev | next [-]

that link doesn't exist anymore? what did it say?

xeromal 17 minutes ago | parent | next [-]

Works for me

argee 28 minutes ago | parent | prev [-]

Strange, it opens just fine...as long as you aren't logged in (to X).

jquery an hour ago | parent | prev [-]

Yeesh. “What shall we do sire, when the peasants learn to read?” vibes

entropicdrifter 16 minutes ago | parent [-]

You mean to tell me that anyone can own a nail-gun? We can't have people buying their own nail-guns, next thing you know they might build things that aren't up to code!

jimbo808 2 hours ago | parent | prev [-]

https://www.judiciary.senate.gov/imo/media/doc/2023-07-26_-_... https://xcancel.com/coinbureau/status/2071330294452666695 https://www.techpolicy.press/transcript-senate-hearing-on-pr...

> "Once the weights of a model are public, they cannot be retrieved. If a model possesses dangerous capabilities, it is permanently out in the wild... We need to consider regulatory frameworks that account for the unique risks of open-source distribution of highly capable frontier models."

TurdF3rguson an hour ago | parent [-]

That's true I guess. If someone decides a model needs more guard rails, anthropic can adjust it, whereas with open weights it's too late.

It definitely sounds like the kind of thing that ends the world in B sci-fi thrillers.

AquinasCoder 9 hours ago | parent | prev | next [-]

While I appreciate, they publish this information, it's increasingly hard to keep track of it all. I've lost the mental model of how different models at different effort levels perform and what tasks they are good at.

In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.

m-dot-reviews 3 hours ago | parent | next [-]

I've been plugging this perhaps too many times now, but I am trying to bootstrap a user-sourced corpus of exactly "what model is good at task X". So, not benchmarks, but high-level tasks. There's a bit of a ordering problem in that nobody wants to bother commenting on a site that has few comments - so PTAL and contribute if you can. https://model.reviews

matheusmoreira 5 hours ago | parent | prev | next [-]

I always use Opus 4.8 at max effort for everything. The $20 subscription didn't have enough tokens, but the $100 one had too many of them. So now I just max out Opus in order to maintain 100% weekly utilization.

easygenes 4 hours ago | parent | next [-]

I'm a heavy enough user that I have both the OAI and Anth $200 plans. I always use at least 50% of my weekly Opus quota at Extra setting (meaning I use double the limit of the $100 plan, at minimum). Max I rarely touch because it is twice as slow and the incremental capability gain is minimal. Usually if Opus can't sort something well at Extra, the answer isn't to use Max but to hand the issue off to GPT-5.5 at XHigh.

tyg13 4 hours ago | parent [-]

I too have settled into a kind of dual Claude/GPT model setup. I will often use one to review the other's work, or critique the other's plan in some way. Sometimes I'll have Claude implement a feature one way, then have GPT do it the other way, then have them both review each other's implementation. Then synthesize a final plan from the previous implementations+reviews.

I might just be having fun with models, but I have actually noticed their capabilities vary somewhat, and so my (perhaps vain) hope is that by using both, one can catch each the other's blindspots. It's still unclear to me if that's consistently happening, but I am making substantial progress in my personal and professional projects, so something seems to be working.

dirtbag__dad 3 hours ago | parent [-]

Yes, same, between the two of them I feel like results are just better because they have different priorities.

At the same time, I’ve invested in tooling that prints and lints architecture I want, so which model is less of an interesting decision, because the results tend to be very close.

ATMLOTTOBEER 5 hours ago | parent | prev [-]

Agreed I think your strategy is optimal. This is what I landed on as well

vcf 5 hours ago | parent [-]

Me too, I rarely hit limits anymore on the $100 Max, except for the brief period with Fable

nolok 6 hours ago | parent | prev | next [-]

Same boat as you, and my answer is "... Except when I ask and overall or checkup task that is specifically heavy or overseeing in which case I use the maximum level" which lately meant ultracode.

I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.

brobdingnagians 6 hours ago | parent | prev | next [-]

I tend to run it on High and then step it up for problems where I'm noticing it struggles, bump it back down after. Sometimes I accidentally leave a session in Ultracode for a day and wonder why things are taking so long, but generally happy with the results.

sanderjd 9 hours ago | parent | prev | next [-]

What I want is a harness that knows how to optimize this kind of thing for me.

nl 4 hours ago | parent | next [-]

In practice I don't think any harness (happy to be corrected here!) uses the lesser capability models for writing code. The cost trade-offs are rarely worth it.

They are often used for reading code though.

To expand on this, while the "big model to write a plan, small model to write the specific code" idea is quite common it trips up on edge cases.

In theory the flow works like this:

- small fast models read lots of code, and pass details to the large model to write a plan

- large model takes those details and writes a detailed plan

- medium models write the code

The issue happens when the medium model hits something that the plan didn't take into account (which happens a lot - the big model didn't actually read the code). Then it has to either guess, or pass back to the large model.

If it guesses, the plan usually starts to fall to bits.

If it passes back to the large model, inevitable the large model has to start reading lots of code. In that case you are paying the expensive tokens to read so you might as well have it write the code too (many less tokens are written than are read)

It might be possible to get this to work, but I haven't seen anyone who has tried agentic work with frontier models be satisfied with this hybrid setup.

I'd note that Amp (mentioned above) is probably the leader in using multiple providers in a coding agent but still uses frontier models to write code.

sanderjd an hour ago | parent [-]

Great info, thanks!

cunningfatalist 8 hours ago | parent | prev | next [-]

You might want to check out Amp: https://ampcode.com/

sanderjd 7 hours ago | parent [-]

I appreciate the suggestion! But it isn't clear to me, from reading their marketing site, what they bring to the table from this perspective. Can you give me a more targeted pitch?

manojlds 9 hours ago | parent | prev [-]

Which is your own harness and your own evals for your tasks I guess

munk-a 6 hours ago | parent | next [-]

I don't demand a customized compiler for my code even if such a compiler could outperform gcc. There is a lot of value in focusing on correctness to an extreme degree even if the outcome might be suboptimal to something more tailored - a tool with a large customer base can justify more resources going into its maintenance.

sanderjd 8 hours ago | parent | prev [-]

Maybe. But that sounds like a large amount of bespoke work for what seems like a common problem?

manojlds 7 hours ago | parent [-]

I was talking about enterprise agents and then realized the question is more about coding agents.

sanderjd 7 hours ago | parent [-]

Ah I see! Yes, I was talking about a coding harness, not an enterprise agent. I entirely agree with you that your suggestion of driving it via evals is the right thing for that use case!

jimbo808 3 hours ago | parent | prev | next [-]

It's really not that much. It's a bit hard to make sense of it not because it's hard to keep track of, but because they are being deceptive and opaque about what you're actually buying, and the thing you're paying for is different from one day to the next, as they fuck around with the parameters to boost subjective performance during a launch, then quietly degrade the service to cut costs.

jbvlkt 7 hours ago | parent | prev | next [-]

Exactly this is my problem with all AI tools. I want someone else to create working tools for me so I can focus on my product. It is the same with other tools. I do not want to spent huge amounts of energy and time to setup my IDE, operating system or desk layout. I guess it is too early to have that now.

jerojero 6 hours ago | parent [-]

I think that's the whole selling point of lovable?

tash_2s 2 hours ago | parent | prev | next [-]

I also ended up using max effort/reasoning for both coding and general chat. They don't spend too much extra time on simple tasks these days.

throwaway219450 4 hours ago | parent | prev | next [-]

Same advice as ever? We call it context engineering now, but prompt engineering still matters a lot. Most of the failures I run into are unspecified assumptions made by the model that derails the conversation, but usually updating the first prompt fixes it. Opus in my experience is a bit better about checking assumptions, while Sonnet will plow on ahead. An example is mentioning a file that doesn't exist: Sonnet will go ahead and try to grep your entire hard drive for it. Opus will say it's not local and request the path.

I trust neither for general knowledge and I still find Opus giving me answers that are completely BS. But the token spend for Q&A is nothing compared to coding, so I always use Opus + a lot of thinking. For coding, I find Opus to be better value/token but I haven't done any sort of rigorous test.

deadbabe 4 hours ago | parent | prev | next [-]

There are token optimization consultants that can help organizations find the right balance of models for their employees to minimize costs.

j45 6 hours ago | parent | prev | next [-]

Just because it’s hard to keep track of doesn’t mean it’s not relevant.

Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.

paulddraper 6 hours ago | parent | prev | next [-]

It's almost like you want an automatically intelligent choice of your artificial intelligence.

Understandable frankly.

jacooper 9 hours ago | parent | prev [-]

Just use deepswe as a reference point.

2001zhaozhao 9 hours ago | parent | prev | next [-]

There are two wrinkles to this:

- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.

- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them

timcobb 9 hours ago | parent [-]

> This is why there was a "Sonnet only" usage bar for Max tier for the longest time.

it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?

i000 7 hours ago | parent | next [-]

They want to encourage diversifying model use.

radlad 7 hours ago | parent | next [-]

Seems kinda weird - it's cognitive load I'd love to avoid. If I'm going to take it on, I might as well try other providers.

aqfamnzc 7 hours ago | parent | prev [-]

Why?

munk-a 6 hours ago | parent [-]

It helps solicit more feedback and lets them trial different approaches. You're not just a user, you're a tester!

laughingcurve 7 hours ago | parent | prev [-]

Distillation attacks? Volume of calls?

energy123 9 hours ago | parent | prev | next [-]

The arguable caveat is Sonnet may run faster (although this isn't known for sure, due to more tokens being used for the same task), so you can potentially get more done in a synchronous iterative workflow

I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.

kolinko 6 hours ago | parent [-]

From my benchmarks, sadly, it doesn't seem to be the case much. Surprisingly. I found Sonnet comparable in speed to Opus (sic), but perhaps I was testing it wrong?

riverbirch 5 hours ago | parent [-]

I can confirm this, I too I'm not seeing much of a difference in practice

Torkel 9 hours ago | parent | prev | next [-]

Yeah, I was looking at the same chart and was very surprised at where the curve is relative to opus... Feels like sonnet 5 is "what if opus had an extra-low effort level"?

XCSme 6 hours ago | parent | prev | next [-]

Well, it is a Sonnet model, it is indeed better[0] than Sonnet 4.6 (smarter, faster, cheaper), but I don't see why would you use it as opposed to Opus 4.8 low or GLM-5.2...

[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

XCSme 6 hours ago | parent [-]

What's interesting, is that Sonnet 5 is actually worse[0] than 4.6 without reasoning.

It makes some sense, as models are trained more and more with reasoning, than without.

[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-non...

lucamark 8 hours ago | parent | prev | next [-]

You're referring to the Agentic search, but if you look at the Agentic computer use the cost is basically halved.

However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.

Rarely used Sonnet btw.

energy123 8 hours ago | parent | next [-]

You're the second person that has said this but I cannot understand why you are interpreting the "Agentic computer use" graph in this manner.

The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.

lucamark 8 hours ago | parent [-]

Wrong! Look at it better. It shows that Opus has superior performance but at higher cost.

doctoboggan 8 hours ago | parent | next [-]

No, you are misunderstanding the graph. Draw a vertical line anywhere, that is a "constant cost" line. For any given cost, Opus 4.8 has a higher performance than Sonnet 5. Only where Sonnet 5 effort is at medium or low would it make any sense to use it, as there isn't even an equivalent Opus effort level to compare to.

Alternatively you can draw a horizontal "constant performance" line and see that Opus is cheaper for a given performance level.

827a 8 hours ago | parent | prev | next [-]

Why are you comparing xhigh reasoning between Sonnet and Opus? Of course Sonnet xhigh is cheaper than Opus xhigh, but that isn't the point; the point is that at e.g. 80% accuracy on Opus costs ~$0.45 (medium reasoning) whereas on Sonnet it costs ~$0.52 (xhigh/max reasoning).

brokencode 8 hours ago | parent | prev | next [-]

That is a bad comparison. Compare Sonnet xhigh against Opus medium, which is both better and cheaper.

energy123 8 hours ago | parent | prev [-]

No, that's apples and oranges. You need to compare Sonnet5's 79% with the interpolated Opus4.8's 79%.

annzabelle 6 hours ago | parent | prev | next [-]

> Too expensive to perform daily tasks - open souce models are much cheaper

There is a real advantage, especially for businesses, in using an off the shelf solution from a corporate provider.

Personally, the advantage of not having to set up multiple solutions from multiple sources outweighs the cost of a $20 a month subscription. Think about why a lot of consumers prefer Apple devices over Linux. There are a lot of advantages to Linux, but "never having to think about my tools" is its own advantage.

girvo 6 hours ago | parent | prev [-]

The specific market positioning is... for me to use at my big tech company job, where we aren't allowed to use GLM and similar, but have fixed caps on how much token usage we're allowed to rack up a month.

johnfn 9 hours ago | parent | prev | next [-]

That's just one benchmark, though. Tab to the next one and Sonnet 5 performs better as effort goes up just as you'd expect. I imagine the suggestion is that performance vs effort tradeoff is task dependent.

energy123 9 hours ago | parent [-]

No it doesn't? It's worse than Opus across the whole shared frontier on both plots.

acchow 7 hours ago | parent [-]

Agreed. The graphs clearly show that opus 4.8 performs strictly better at the same cost per task

jsnell 7 hours ago | parent [-]

But they don't show "strictly better" performance at cost per task!

The graphs show parts of the cost/performance pareto frontier occupied by Opus 4.8 and others occupied by Sonnet 5.0. If Opus 4.8 was strictly better at cost per task like you say, by definition the entire frontier would be occupied by Opus.

So neither is pareto-dominant over the other. In contrast, Sonnet 5.0 is Pareto-dominent over Sonnet 4.6 on those graphs.

energy123 6 hours ago | parent [-]

> by definition the entire frontier would be occupied by Opus.

But the entire frontier is occupied by Opus under any reasonable interpolation scheme (piecewise linear which is what they've done, and most reasonable spline or polynomial fits would also lead to the same result) over the overlapping x values for which both are defined.

Under that interpolation scheme, for x > ($ cost of Opus low effort), Opus is Pareto-dominant over Sonnet 5. You can see this by picking any point on Opus's interpolation and realizing that you get strictly worse by switching to Sonnet for the same x value or the same y value. Meaning if you want to pay the same $x then you get a worse y, or if you want the same y you pay more $x.

jsnell 6 hours ago | parent [-]

I really don't get what you're proposing. The cost ranges do not overlap at the low end. You can't (by definition!) interpolate outside of the range.

If you mean extrapolate, at that point you're just making up data. The available effort levels are discrete and covered totally by the benchmarks. You can draw on the monitor with a sharpie to show a "ultra-low" effort level for Opus that scores better than Sonnet "low" at the same price, but it doesn't magic the ultra-low effort into actual existence.

(Anyway, the blog post now has an errata and a graph that shows substantially better relative performance for Sonnet 5.0 than the original graph.)

energy123 6 hours ago | parent [-]

That's why I said "over the shared frontier" in my first post and more precisely in my second post I said "over the overlapping x values for which both are defined."

It was a claim that applies to a range of x-values where both curves are defined.

Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region. Which is what I understand to be your point?

jsnell 6 hours ago | parent [-]

The post I was replying to said "performs strictly better at the same cost per task". That claim was obviously not true, there are costs where Opus cannot do the task and Sonnet can, so Opus can't be performing strictly better that the same cost. It seems that you agree that it is not true.

You could make it true by artificially dropping some of the data points, but, like, why?

(Again, this is moot given the updated graph.)

> Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region.

Not so! It's only sound to do that at the low end of the cost axis (x) or the high end of the performance axis (y). You can't do it at the low end of the performance axis or the high end of the cost axis.

seiru 9 hours ago | parent | prev | next [-]

Worth noting that the default chart there is for "agentic search performance", not coding. I didn't see an effort comparison for coding specifically.

booi 8 hours ago | parent | prev | next [-]

i actually exclusively use Sonnet in low effort level. It's too slow otherwise and at a higher effort levels is strictly worse than Opus.

partsch 6 hours ago | parent | prev | next [-]

I feel like the charts have been adjusted. I am quite sure, they looked different a couple hours ago...

callahad 5 hours ago | parent [-]

They've absolutely both changed. The initial version I saw didn't include max effort data points on the first chart, and the plot itself was much less favorable to Sonnet at high/xhigh relative to Opus, but the new chart shows them as closer competitors. Weird.

intellijdd 9 hours ago | parent | prev | next [-]

I noticed that as well but with the introductory pricing, I wonder how true that is.

It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.

I guess I could get Sonnet 5 to do it.

manojlds 9 hours ago | parent | prev | next [-]

Opus 4.8 high doing better and cheaper than Sonnet 5 xhigh

al_borland 9 hours ago | parent | prev | next [-]

What is a "task" in real-world terms? If it will be $15/million output tokens, and high/xhigh is somewhere in the $7.50/task range. Does that mean a single task is using 500k tokens. That seems like it would start to add up fast.

wyre 8 hours ago | parent [-]

I’ve found input tokens is around 5x more than output, so a task could be a couple million thinking tokens and then a few couple 100k output tokens?

goldenarm 7 hours ago | parent | prev | next [-]

It's funny the exact same thing happened to Gemini 3.5 flash. Cheaper and more agentic model that ends up worse and more expensive than 3.5 pro low.

Readerium 5 hours ago | parent [-]

3.5 Pro not yet launched, you mean 3.1 pro?

goldenarm 5 hours ago | parent [-]

Yes sorry for the typo

Natelinathan 8 hours ago | parent | prev | next [-]

I just re-wrote the /code-review skill anthropic ships to use Sonnet 4.6 for some tasks as it was using Opus for simple git diff commands and similarily mechanical tasks (launched 100+ agents for one of my diffs, cmon). I wonder how Sonnet 5 will impact my usage.

Does anyone else have any review token saving measures?

nicce 8 hours ago | parent | prev | next [-]

> Opus always performs better for a given cost.

Assume it to get deprecated sooner rather than later.

ZeWaka 9 hours ago | parent | prev | next [-]

It's very interesting. Why even release a new product that underperforms at the same price level? Why not just lock it?

I guess it's probably a lot cheaper for them to run, and it cuts costs for them. Seems disingenuous, though.

make3 5 hours ago | parent | prev | next [-]

it might be worth it if speed is an issue

windexh8er 4 hours ago | parent | prev [-]

Except for the fact that Opus 4.8 is not good. Constant hallucinations, doesn't use the web very intentionally until you explicitly ask it to and it nopes out rather quick on benign items. Anthropic has been very disappointing as of late. All of the gatekeeping is taking a toll on what should be some of the better models out there, but you can't trust 4.8 to go off on its own. It will burn down tokens doing what it deems correct as per its guidance. Truly painful to use.

lukan 4 hours ago | parent [-]

"but you can't trust 4.8 to go off on its own."

And what (avaiable) model do you trust to go off on its own?