Remix.run Logo
Aurornis 6 hours ago

> It's not really clear whether Opus 4.5+ represent a level shift on this frontier or just inhabits place on that curve which delivers higher performance, but at rapidly diminishing returns to inference cost.

I think we're reaching the point where more developers need to start right-sizing the model and effort level to the task. It was easy to get comfortable with using the best model at the highest setting for everything for a while, but as the models continue to scale and reasoning token budgets grow, that's no longer a safe default unless you have unlimited budgets.

I welcome the idea of having multiple points on this curve that I can choose from. depending on the task. I'd welcome an option to have an even larger model that I could pull out for complex and important tasks, even if I had to let it run for 60 minutes in the background and made my entire 5-hour token quota disappear in one question.

I know not everyone wants this mental overhead, though. I predict we'll see more attempts at smart routing to different models depending on the task, along with the predictable complaints from everyone when the results are less than predictable.

KronisLV 4 hours ago | parent | next [-]

> It was easy to get comfortable with using the best model at the highest setting for everything for a while, but as the models continue to scale and reasoning token budgets grow, that's no longer a safe default unless you have unlimited budgets.

For a while I used Cerebras Code for 50 USD a month with them running a GLM model and giving you millions of tokens per day. It did a lot of heavy lifting in a software migration I was doing at the time (and made it DOABLE in the first place), BUT there were about 10 different places where the migration got fucked up and had to manually be fixed - files left over after refactoring (what's worse, duplicated ones basically), some constants and routes that are dead code, some development pages that weren't removed when they were superseded by others and so on.

I would say that Claude Code with throwing Opus at most problems (and it using Sonnet or Haiku for sub-agents for simple and well specified tasks) is actually way better, simply because it fucks things up less often and review iterations at least catch when things are going wrong like that. Worse models (and pretty much every one that I can afford to launch locally, even ones that need around ~80 GB of VRAM in the context of an org wanting to self-host stuff) will be confidently wrong and place time bombs in your codebases that you won't even be aware of if you don't pay enough attention to everything - even when the task was rote bullshit that any model worth its salt should have resolved with 0 issues.

My fear is that models that would let me truly be as productive as I want with any degree of confidence might be Mythos tier and the economics of that just wouldn't work out.

Aurornis an hour ago | parent [-]

Good points. I was speaking from a position of using an LLM in a pair programming style where I'm interactive with each request.

For handing work off to an LLM in large chunks, picking the best model available is the only way to go right now.

dustingetz 4 hours ago | parent | prev | next [-]

Human dev labor cost is still the high pole in the tent, even multiplying today's subsidized subscription cost by 10x. If the capability improvement trajectory continues, developers should prepare for a new economy where more productivity is achieved by fewer devs by shifting substantial labor budget to AI.

johnmaguire 2 hours ago | parent | next [-]

I'm getting a lot more done by handing off the code writing parts of my tasks to many agents running simultaneously. But my attention still has its limits.

what an hour ago | parent | prev [-]

Your employer doesn’t pay the subscription cost, they pay per token. So it’s already way more than 10x the cost.

richstokes 3 hours ago | parent | prev | next [-]

The problem is half the time you don't know you need the better model until the lesser model has made a massive mess. Then you have to do it again on the good model, wasting money. The "auto" modes don't seem to do a good job at picking a model IME.

2 hours ago | parent [-]
[deleted]
dahart 4 hours ago | parent | prev | next [-]

> I know not everyone wants this mental overhead, though.

I’m curious how to even do it. I have no idea how to choose which model to use in advance of a given task, regardless of the mental overhead.

And unless you can predict perfectly what you need, there’s going to be some overuse due to choosing the wrong model and having to redo some work with a better model, I assume?

Leynos 4 hours ago | parent | prev | next [-]

Isn't that essentially GPT Pro Extended Thinking?

jpalawaga 5 hours ago | parent | prev | next [-]

Except developers can’t even do that. Estimation of any not-small task that hasn’t been done before is essentially a random guess.

nilkn 5 hours ago | parent | next [-]

I don't completely agree. Estimation is nontrivial, but not necessarily a random guess. Teams of human engineers have been doing this for decades -- not always with great success, but better than random. Deciding whether to put an intern or your best staff engineer on a problem is a challenge known to any engineering manager and TPM.

jpalawaga an hour ago | parent [-]

or tech lead. or whoever. the point is, someone has to do the sizing. I think applying an underpowered agent to a task of unknown size is about as good as getting the intern to do it.

Even EMs and TPMs are assigning people based on their previous experience, which generally boils down to "i've seen this task before and I know what's involved," "this task is small, and I know what's involved," or "this task is too big and needs to be understood better."

justapassenger 4 hours ago | parent | prev [-]

That's why you split tasks and do project management 101.

That's how things worked pre-AI, and old problems are new problems again.

When you run any bigger project, you have senior folks who tackle hardest parts of it, experienced folks who can churn out massive amounts of code, junior folks who target smaller/simpler/better scoped problems, etc.

We don't default to tell the most senior engineer "you solve all of those problems". But they're often involved in evaluation/scoping down/breakdown of problem/supervising/correcting/etc.

There's tons of analogies and decades of industry experience to apply here.

jpalawaga an hour ago | parent [-]

Yeah... you split tasks into consecutively smaller tasks until it's estimateable.

I'm not saying that can't be done, but taking a large task that hasn't been broken down needs, you guessed it, a powerful agent. that's your senior engineer who can figure out the rote parts, the medium parts, and the thorny parts.

the goal isn't to have an engineer do that. we should still be throwing powerful agents at a problem, they should just be delegating the work more efficiently.

throwing either an engineer or an agent at any unexplored work means you just have to delegate the most experienced resource to, or suffer the consequences.

KaiShips 5 hours ago | parent | prev [-]

[dead]