That's a great write up.

The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.

So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.

▲

sanderjd 6 hours ago | parent | next [-]

Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?

▲

theshrike79 4 hours ago | parent | next [-]

The power of Opus isn't just the model, it's in the harness too.

You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).

▲

larsnystrom 4 hours ago | parent | next [-]

I’ve only used Opus in GitHub copilot and was hugely underwhelmed. It was barely usable. Are you saying it’s better with the official Anthropic tools?

	▲	theshrike79 2 hours ago \| parent \| next [-]
		Night and day in my opinion. But these are all purely Feels so YMMV etc. I like how especially the Claude Code CLI version communicates how it's progressing, something they hide a lot more on the desktop app for example.
	▲	m-ee 4 hours ago \| parent \| prev [-]
		I don't know about better but it's certainly different. It's painfully slow through claude code vscode extension compared to copilot but maybe "smarter", I feel like I have to correct it less using sonnet on both. I don't use opus much because of the cost but coworkers say the difference between harnesses there is also pronounced.

▲

throwa356262 2 hours ago | parent | prev [-]

open source harnesses are also improving rapidly.

Some people would claim they are already far better than CC and Codex.

▲

theplumber 6 hours ago | parent | prev | next [-]

I think in the next 6 months we will have Opus 4.5 performance in open models. We are very close

	▲	krzyk an hour ago \| parent [-]
		We need first to reach level of Sonnet 4.x, we aren't at that level yet.

▲

marak830 5 hours ago | parent | prev [-]

GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).

▲

3abiton 3 hours ago | parent | prev | next [-]

And a big thing that's missing is ... the harness comparison. Ot plays a very big role. I use forge, and I have been inpressed with what it can do given all the limitations of local models.

▲

rippeltippel 6 hours ago | parent | prev | next [-]

Since the author is referring to a specific model, I think it makes sense to ignore how the model (or local models in general) may improve over time.

It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.

It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.

▲

appplication 6 hours ago | parent | prev [-]

Agree 100%, even on claude 4.5 being the turning point for agentic coding. It completely turned me around on it.