Remix.run Logo
AquinasCoder 9 hours ago

While I appreciate, they publish this information, it's increasingly hard to keep track of it all. I've lost the mental model of how different models at different effort levels perform and what tasks they are good at.

In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.

m-dot-reviews 3 hours ago | parent | next [-]

I've been plugging this perhaps too many times now, but I am trying to bootstrap a user-sourced corpus of exactly "what model is good at task X". So, not benchmarks, but high-level tasks. There's a bit of a ordering problem in that nobody wants to bother commenting on a site that has few comments - so PTAL and contribute if you can. https://model.reviews

matheusmoreira 5 hours ago | parent | prev | next [-]

I always use Opus 4.8 at max effort for everything. The $20 subscription didn't have enough tokens, but the $100 one had too many of them. So now I just max out Opus in order to maintain 100% weekly utilization.

easygenes 4 hours ago | parent | next [-]

I'm a heavy enough user that I have both the OAI and Anth $200 plans. I always use at least 50% of my weekly Opus quota at Extra setting (meaning I use double the limit of the $100 plan, at minimum). Max I rarely touch because it is twice as slow and the incremental capability gain is minimal. Usually if Opus can't sort something well at Extra, the answer isn't to use Max but to hand the issue off to GPT-5.5 at XHigh.

tyg13 3 hours ago | parent [-]

I too have settled into a kind of dual Claude/GPT model setup. I will often use one to review the other's work, or critique the other's plan in some way. Sometimes I'll have Claude implement a feature one way, then have GPT do it the other way, then have them both review each other's implementation. Then synthesize a final plan from the previous implementations+reviews.

I might just be having fun with models, but I have actually noticed their capabilities vary somewhat, and so my (perhaps vain) hope is that by using both, one can catch each the other's blindspots. It's still unclear to me if that's consistently happening, but I am making substantial progress in my personal and professional projects, so something seems to be working.

dirtbag__dad 3 hours ago | parent [-]

Yes, same, between the two of them I feel like results are just better because they have different priorities.

At the same time, I’ve invested in tooling that prints and lints architecture I want, so which model is less of an interesting decision, because the results tend to be very close.

ATMLOTTOBEER 5 hours ago | parent | prev [-]

Agreed I think your strategy is optimal. This is what I landed on as well

vcf 5 hours ago | parent [-]

Me too, I rarely hit limits anymore on the $100 Max, except for the brief period with Fable

nolok 6 hours ago | parent | prev | next [-]

Same boat as you, and my answer is "... Except when I ask and overall or checkup task that is specifically heavy or overseeing in which case I use the maximum level" which lately meant ultracode.

I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.

brobdingnagians 6 hours ago | parent | prev | next [-]

I tend to run it on High and then step it up for problems where I'm noticing it struggles, bump it back down after. Sometimes I accidentally leave a session in Ultracode for a day and wonder why things are taking so long, but generally happy with the results.

sanderjd 9 hours ago | parent | prev | next [-]

What I want is a harness that knows how to optimize this kind of thing for me.

nl 4 hours ago | parent | next [-]

In practice I don't think any harness (happy to be corrected here!) uses the lesser capability models for writing code. The cost trade-offs are rarely worth it.

They are often used for reading code though.

To expand on this, while the "big model to write a plan, small model to write the specific code" idea is quite common it trips up on edge cases.

In theory the flow works like this:

- small fast models read lots of code, and pass details to the large model to write a plan

- large model takes those details and writes a detailed plan

- medium models write the code

The issue happens when the medium model hits something that the plan didn't take into account (which happens a lot - the big model didn't actually read the code). Then it has to either guess, or pass back to the large model.

If it guesses, the plan usually starts to fall to bits.

If it passes back to the large model, inevitable the large model has to start reading lots of code. In that case you are paying the expensive tokens to read so you might as well have it write the code too (many less tokens are written than are read)

It might be possible to get this to work, but I haven't seen anyone who has tried agentic work with frontier models be satisfied with this hybrid setup.

I'd note that Amp (mentioned above) is probably the leader in using multiple providers in a coding agent but still uses frontier models to write code.

sanderjd an hour ago | parent [-]

Great info, thanks!

cunningfatalist 8 hours ago | parent | prev | next [-]

You might want to check out Amp: https://ampcode.com/

sanderjd 6 hours ago | parent [-]

I appreciate the suggestion! But it isn't clear to me, from reading their marketing site, what they bring to the table from this perspective. Can you give me a more targeted pitch?

manojlds 9 hours ago | parent | prev [-]

Which is your own harness and your own evals for your tasks I guess

munk-a 6 hours ago | parent | next [-]

I don't demand a customized compiler for my code even if such a compiler could outperform gcc. There is a lot of value in focusing on correctness to an extreme degree even if the outcome might be suboptimal to something more tailored - a tool with a large customer base can justify more resources going into its maintenance.

sanderjd 8 hours ago | parent | prev [-]

Maybe. But that sounds like a large amount of bespoke work for what seems like a common problem?

manojlds 7 hours ago | parent [-]

I was talking about enterprise agents and then realized the question is more about coding agents.

sanderjd 7 hours ago | parent [-]

Ah I see! Yes, I was talking about a coding harness, not an enterprise agent. I entirely agree with you that your suggestion of driving it via evals is the right thing for that use case!

jimbo808 3 hours ago | parent | prev | next [-]

It's really not that much. It's a bit hard to make sense of it not because it's hard to keep track of, but because they are being deceptive and opaque about what you're actually buying, and the thing you're paying for is different from one day to the next, as they fuck around with the parameters to boost subjective performance during a launch, then quietly degrade the service to cut costs.

jbvlkt 7 hours ago | parent | prev | next [-]

Exactly this is my problem with all AI tools. I want someone else to create working tools for me so I can focus on my product. It is the same with other tools. I do not want to spent huge amounts of energy and time to setup my IDE, operating system or desk layout. I guess it is too early to have that now.

jerojero 6 hours ago | parent [-]

I think that's the whole selling point of lovable?

tash_2s 2 hours ago | parent | prev | next [-]

I also ended up using max effort/reasoning for both coding and general chat. They don't spend too much extra time on simple tasks these days.

throwaway219450 4 hours ago | parent | prev | next [-]

Same advice as ever? We call it context engineering now, but prompt engineering still matters a lot. Most of the failures I run into are unspecified assumptions made by the model that derails the conversation, but usually updating the first prompt fixes it. Opus in my experience is a bit better about checking assumptions, while Sonnet will plow on ahead. An example is mentioning a file that doesn't exist: Sonnet will go ahead and try to grep your entire hard drive for it. Opus will say it's not local and request the path.

I trust neither for general knowledge and I still find Opus giving me answers that are completely BS. But the token spend for Q&A is nothing compared to coding, so I always use Opus + a lot of thinking. For coding, I find Opus to be better value/token but I haven't done any sort of rigorous test.

deadbabe 4 hours ago | parent | prev | next [-]

There are token optimization consultants that can help organizations find the right balance of models for their employees to minimize costs.

j45 5 hours ago | parent | prev | next [-]

Just because it’s hard to keep track of doesn’t mean it’s not relevant.

Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.

paulddraper 6 hours ago | parent | prev | next [-]

It's almost like you want an automatically intelligent choice of your artificial intelligence.

Understandable frankly.

jacooper 9 hours ago | parent | prev [-]

Just use deepswe as a reference point.