| ▲ | prodigycorp a day ago |
| This article was more true than not a year ago but now the harnesses are so far past the simple agent loop that I'd argue that this is not even close to an accurate mental model of what claude code is doing. |
|
| ▲ | qsort a day ago | parent | next [-] |
| Obviously modern harnesses have better features but I wouldn't say it invalidates the mental model. Simpler agents aren't that far behind in performance if the underlying model is the same, including very minimal ones with basic tools. I'd say it's similar to how a "make your own relational DB" article might feature a basic B-tree with merge-joins. Yeah, obviously real engines have sophisticated planners, multiple join methods, bloom filters, etc., but the underlying mental model is still accurate. |
| |
| ▲ | prodigycorp a day ago | parent [-] | | You’re not wrong but I still think that the harness matters a lot when trying to accurately describe Claude Code. Here’s a reframing: If you asked people “what would you rather work with, today’s Claude Code harness with sonnet 3.7, or the 200 line agentic loop in the article with Opus 4.5, which would you choose?” I suspect many people would choose 3.7 with the harness. Moreover, that is true, then I’d say the article is no longer useful for a modern understanding of Claude Code. | | |
| ▲ | aszen a day ago | parent | next [-] | | I don't think so, model improvements far outweigh any harness or tooling. Look at https://github.com/SWE-agent/mini-swe-agent for proof | | |
| ▲ | prodigycorp a day ago | parent [-] | | Yes but people aren’t choosing CC because they are necessarily performance maximalists. They choose it because it has features that make it behave much more nicely as a pair programming assistant than mini-swe-agent. There’s a reason Cursor poached Boris Cherney and Cat Wu and Anthropic hired them back! | | |
| ▲ | aszen a day ago | parent [-] | | They nailed down the UX I would say and the models themselves are a lot better even outside of CC | | |
| ▲ | prodigycorp a day ago | parent [-] | | I don’t think I disagree with you about anything, I’m trying to split hairs at this point. |
|
|
| |
| ▲ | rfw300 a day ago | parent | prev | next [-] | | Any person who would choose 3.7 with a fancy harness has a very poor memory about how dramatically the model capabilities have improved between then and now. | | |
| ▲ | prodigycorp a day ago | parent [-] | | I’d be very interested in the performance of 3.7 decked out with web search, context7, a full suite of skills, and code quality hooks against opus 4.5 with none of those. I suspect it’s closer than you think! | | |
| ▲ | CuriouslyC a day ago | parent | next [-] | | Skills don't make any difference above having markdown files to point an agent to with instructions as needed. Context7 isn't any better than telling your agent to use trafilatura to scrape web docs for your libs, and having a linting/static analysis suite isn't a harness thing. 3.7 was kinda dumb, it was good at vibe UIs but really bad at a lot of things and it would lie and hack rewards a LOT. The difference with Opus 4.5 is that when you go off the Claude happy path, it holds together pretty well. With Sonnet (particularly <=4) if you went off the happy path things got bad in a hurry. | | |
| ▲ | prodigycorp a day ago | parent [-] | | Yeah. 3.7 was pretty bad. I remember its warts vividly. It wanted to refactor everything. Not a great model on which to hinge this provocation. But skills do improve model performance, OpenAI posted some examples of how it massively juiced up their results on some benchmarks. |
| |
| ▲ | nl a day ago | parent | prev [-] | | > I suspect it’s closer than you think! It's not. I've done this (although not with all these tools). For a reasonable sized project it's easy to tell the difference in quality between say Grok-4.1-Fast (30 on AA Coding Index) and Sonnet 4.5 (37 on AA). Sonnet 3.7 scores 27. No way I'm touching that. Opus 4.5 scores 46 and it's easy to see that difference. Give the models something with high cyclomtric complexity or complex dependency chains and Grok-4.1-Fast falls to bits, Opus 4.5 solves things. |
|
| |
| ▲ | nl a day ago | parent | prev [-] | | This is SO wrong. I actually wrote my own simple agent (with some twists) in part so I could compare models. Opus 4.5 is in a completely different league to Sonnet 4.5, and 3.7 isn't even on the same planet. I happily use my agent with Opus but there is no world in which I'd use a Sonnet 3.7 level model for anything beyond simple code completion. |
|
|
|
| ▲ | alright2565 a day ago | parent | prev | next [-] |
| But does that extra complexity actually improve performance? https://www.tbench.ai/leaderboard/terminal-bench/2.0 says yes, but not as much as you'd think. "Terminus" is basically just a tmux session and LLM in a loop. |
| |
| ▲ | prodigycorp a day ago | parent [-] | | I'm not a good representative for claude code because I'm primarily a codex user now, but I know that if codex had subagents it would be at least twice as productive. Time spent is an important aspect of performance so yup, the complexity improved performance. | | |
| ▲ | nyellin a day ago | parent | next [-] | | Not necessarily true. Subagents allow for parallelization but they can decrease accuracy dramatically if you're not careful because there are often dependencies between tasks and swapping context windows with a summary is extremely lossy. For the longest time, Claude Code itself didnt really use subagents much by default, other than supporting them as a feature eager users could configure. (Source is reverse engineering we did on Claude code using the fantastic CC tracing tool Simon Willison wrote about once. This is also no longer true on latest versions that have e.g. an Explore subagent that is actively used.) | | |
| ▲ | prodigycorp a day ago | parent [-] | | You’re right that subagents were more likely to cause issues than be helpful. But, when properly understood, lead to so much time saved through parallelization for tasks that warranted it. I was having codex organize my tv/movie library the other day by having it generate. most of the files were not properly labeled. I had codex generate transcripts, manually search the movie db to find descriptions of show episodes, and match the show descriptions against the transcripts to figure out which episode/season the show belonged to. Claude Code could have parallelized those manual checks and finished that task at 8x the speed. |
| |
| ▲ | terminalshort a day ago | parent | prev [-] | | Are subagents a fundamental change, or just acting as inner loops to the agentic loop similar to the one in the article? | | |
| ▲ | steveklabnik a day ago | parent [-] | | Subagents, in my understanding, are just a tool call from the perspective of the parent. |
|
|
|
|
| ▲ | lukan a day ago | parent | prev | next [-] |
| The article was also published one year ago on january 2025. (Should have 2025 in the title? Time flies) |
| |
| ▲ | llmslave2 a day ago | parent [-] | | Claude Code didn't exist in January 2025. I think it's a typo and should be 2026. | | |
| ▲ | prodigycorp a day ago | parent [-] | | You’re right. No wonder the date felt odd. iirc Claude code was released around march. | | |
|
|
|
| ▲ | CuriouslyC a day ago | parent | prev | next [-] |
| Less true than you think. A lot of the progress in the last year has been tightening agentic prompts/tools and getting out of the way so the model can flex. Subagents/MCP/Skills are all pretty mid, and while there has been some context pruning optimization to avoid carrying tool output along forever, that's mainly a benefit to long running agents and for short tasks you won't notice. |
| |
| ▲ | prodigycorp a day ago | parent [-] | | All of these things you mentioned are put into a footnote of the article. |
|
|
| ▲ | dkdcio a day ago | parent | prev | next [-] |
| it seems to have changed a ton in recent versions too — I would love more details on what exactly I find it doing what I in the past had to interrupt and tell it to do fairly frequently now |
| |
| ▲ | terminalshort a day ago | parent [-] | | For one thing it seems to splitting up the work and making some determination of complexity, then allocating it out to a model based on that complexity to save resources. When I run Claude with Opus 4.5 and run /cost I see tokens for Opus 4.5, but also a lot in Sonnet and Haiku, with the majority of tokens actually being used by Haiku. | | |
| ▲ | nyellin a day ago | parent [-] | | Haiku is called often, but not always the way you think. E.g. every time you write something CC invokes Haiku multiple times to generate the 'delightful 1-2 word phrase used to indicate progress to the user' (Doing Stuff, Wizarding, etc) | | |
| ▲ | dkdcio a day ago | parent [-] | | it’s also used in the Explore agent and for other things too |
|
|
|
|
| ▲ | pama a day ago | parent | prev | next [-] |
| Agreed. You can get a better model using the codex-cli repo and having an agent help you analyze the core functionality. |
|
| ▲ | splike a day ago | parent | prev [-] |
| I'm interested, could you expand on that? |
| |
| ▲ | prodigycorp a day ago | parent [-] | | Off the top of my head: parallel subagents, hooks, skills, and a much better plan mode. These features enable way better steering than we had last year. Subagents are a huge boon to productivity. | | |
|