Remix.run Logo
logicprog 9 hours ago

I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!

I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.

I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.

andai 5 hours ago | parent | next [-]

My Weird Hill is that we should be building things with GPT-4.

I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!

I say this because I did!

Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.

Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!

Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)

-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)

jstummbillig 3 hours ago | parent | next [-]

The problem with these exercises is always: I have limited time and capacity to do things, and a fairly unlimited number of problems that I can think of to solve. Coding is not a problem I want to solve. Prompt engineering is not a problem I want to solve.

If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.

logicprog 4 hours ago | parent | prev [-]

> Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)

Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.

> My Weird Hill is that we should be building things with GPT-4.

I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!

andai 4 hours ago | parent [-]

To clarify, I didn't mean we should be using ancient models in production, I meant in R&D.

Anthropic says "do the simplest thing that works." If it works with the LLMs we had 3 years ago, doesn't that make it simpler?

The newer LLMs mostly seem to work around the poor system design. (Like spawning 50 subagents on a grep-spree because you forgot to tell it where anything is...) But then you get poor design in prod!

codazoda an hour ago | parent | prev | next [-]

Ive been working on Peen, a CLI that lets local Ollama models call tools effectively. It’s quite amateur, but I’ve been surprised how spending a few hours on prompting, and code to handle responses, can improve the outputs of small local models.

https://github.com/codazoda/peen

5 minutes ago | parent | next [-]
[deleted]
JSR_FDED 12 minutes ago | parent | prev [-]

Very cool. Love to see more being squeezed from smaller models.

mycall 9 hours ago | parent | prev | next [-]

If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.

OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.

Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.

logicprog 9 hours ago | parent | prev | next [-]

Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.

rubenflamshep 9 hours ago | parent | next [-]

It didn’t read as AI to me :)

drob518 6 hours ago | parent | prev | next [-]

That's what all the AIs have been trained to say.

co_king_3 6 hours ago | parent | prev | next [-]

No one here will accuse you of being an AI unless they're trying to dehumanize you for expressing anti-AI sentiment.

logicprog 4 hours ago | parent [-]

I'm sorry, but that's empirically false. E.g., a substantial proportion of the highly upvoted comments on https://news.ycombinator.com/item?id=46953491, which was one of the best articles on software engineering I've read in a long time, are accusing it of being AI for no reason.

kachapopopow 9 hours ago | parent | prev [-]

why the long -'s

logicprog 9 hours ago | parent [-]

Because I like them?

kachapopopow 8 hours ago | parent | next [-]

reminds me of that one guy complaining that everyone is calling them an AI when AI was trained on their grammar style.

ahofmann 8 hours ago | parent [-]

This happened to the female speaker with her voice, which I find terrifying: https://www.youtube.com/watch?v=qO0WvudbO04

soperj 8 hours ago | parent | prev [-]

how do you make them?

RussianCow 7 hours ago | parent [-]

On macOS, Option+Shift+- and Option+- insert an em dash (—) and en dash (–), respectively. On Linux, you can hit the Compose Key and type --- (three hyphens) to get an em dash, or --. (hyphen hyphen period) for an en dash. Windows has some dumb incantation that you'll never remember.

oblio 3 hours ago | parent | next [-]

For Windows it's just easier to make a custom keyboard layout and go to town with that: https://www.microsoft.com/en-us/download/details.aspx?id=102...

BizarroLand 5 hours ago | parent | prev [-]

Alt+0151 or WIN+SHIFT+-, but I can't seem to make the WIN+SHIFT+- combo work in browser, only in a text editor.

noupdates 6 hours ago | parent | prev | next [-]

I was just looking at the SWE-bench docs and it seems like they use almost an arbitrary form of context engineering (loading in some arbitrary amount of files to saturate context). So in a way, the bench suites test how good a model is with little to no context engineering (I know ... it doesn't need to be said). We may not actually know which models are sensitive to good context-engineering, we're simply assuming all models are. I absolutely agree with you on one thing, there is definitely a ton of low hanging fruit.

barrenko 9 hours ago | parent | prev | next [-]

2026 is the year of the harness.

visarga 8 hours ago | parent | next [-]

Already made a harness for Claude to make R/W plans, not write once like they are usually implemented. They can modify themselves as they work through the task at hand. Also relying on a collection of patterns for writing coding task plans which evolves by reflection. Everything is designed so I could run Claude in yolo-mode in a sandbox for long stretches of time.

porker 3 hours ago | parent [-]

Link?

ex-aws-dude 5 hours ago | parent | prev | next [-]

As a VC in 2026 I'm going to be asking every company "but what's your harness strategy?"

kridsdale3 2 hours ago | parent [-]

Given that you're likely in San Francisco, make sure you say "AI Harness".

JSR_FDED 7 minutes ago | parent [-]

It’s all about user-specific bindings.

cyanydeez 3 hours ago | parent | prev | next [-]

2027 is the year of the "maybe indeterminism isn't as valueable as we thought"

miohtama 8 hours ago | parent | prev [-]

But will harness build desktop Linux for us?

vidarh 4 hours ago | parent | next [-]

My harness is improving my Linux desktop...

riskable 7 hours ago | parent | prev [-]

Only if you put bells on it and sing Jingle Bells while it em dashes through the snow.

aeon_ai 9 hours ago | parent | prev | next [-]

Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.

That’s when the future really starts hitting you.

renato_shira 5 hours ago | parent | next [-]

yeah this clicked for me when i stopped obsessing over which model to use and focused on how i structure the context and feedback loops around it. for my project the same model went from "barely usable" to "legitimately helpful" just by changing how i fed it context and how i validated its output.

the user inclusion part is real too. the best results i get aren't from fully autonomous agents, they're from tight human-in-the-loop cycles where i'm steering in real time. the model does the heavy lifting, i do the architectural decisions and error correction. feels more like pair programming than automation.

logicprog 4 hours ago | parent [-]

> the user inclusion part is real too. the best results i get aren't from fully autonomous agents, they're from tight human-in-the-loop cycles where i'm steering in real time. the model does the heavy lifting, i do the architectural decisions and error correction. feels more like pair programming than automation.

Precisely. This is why I use Zed and the Zed Agent. It's near-unparalleled for live, mind-meld pair programming with an agent, thanks to CRDTs, DeltaDB, etc. I can elaborate if anyone is interested.

ambicapter 4 hours ago | parent | next [-]

I am interested.

rahabash 4 hours ago | parent | prev [-]

plz do

logicprog 4 hours ago | parent [-]

The special (or at least new to me) things about Zed (when you use it with the built-in agent, instead of one of the ones available through ACP) basically boil down to the fact that it's a hyper advanced CRDT-based collaborative editor, that's meant for live pair programming in the same file, so it can just treat agents like another collaborator.

1. the diffs from the agent just show up in the regular file you were editing, you're not forced to use a special completion model, or view the changes in a special temporary staging mode or different window.

2. you can continue to edit the exact same source code without accepting or rejecting the changes, even in the same places, and nothing breaks — the diffs still look right, and doing an accept or reject Just Works afterwards.

3. you can accept or reject changes piecemeal, and the model doesn't get confused by this at all and have to go "oh wait, the file was/wasn't changed, let me re-read..." or whatever.

4. Even though you haven't accepted the changes, the model can continue to make new ones, since they're stored as branches in the CRDT, so you can have it iterate on its suggestions before you accept them, without forcing it to start completely over either (it sees the file as if its changes were accepted)

5. Moreover, the actual files on disk are in the state it suggests, meaning you can compile, fuzz, test, run, etc to see what it's proposed changes do before accepting them

6. you can click a follow button and see which files it has open, where it's looking in them, and watch as it edits the text, like you're following a dude in Dwarf Fortress. This means you can very quickly know what it's working on and when, correct it, or hop in to work on the same file it is.

7. It can actually go back and edit the same place multiple times as part of a thinking chain, or even as part of the same edit, which has some pretty cool implications for final code-quality, because of the fact that it can iterate on its suggestion before you accept it, as well as point (9) below

8. It streams its code diffs, instead of hanging and then producing them as a single gigantic tool call. Seeing it edit the text live, instead of having to wait for a final complete diff to come through that you either accept or reject, is a huge boon for iteration time compared to e.g. ClaudeCode, because you can stop and correct it mid way, and also read as it goes so you're more in lockstep with what's happening.

9. Crucially, because the text it's suggesting is actually in the buffer at all times, you can see LSP, tree-sitter, and linter feedback, all inline and live as it writes code; and as soon as it's done an edit, it can see those diagnostics too — so it can actually iterate on what it's doing with feedback before you accept anything, while it is in the process of doing a series of changes, instead of you having to accept the whole diff to see what the LSP says

logicprog 9 hours ago | parent | prev | next [-]

Aha! A true cybernetics enthusiast. I didn't say that because I didn't want to scare people off ;)

drob518 6 hours ago | parent | prev [-]

That's next-year's problem.

fazgha 9 hours ago | parent | prev [-]

So deep your comment. Asking for a friend, how did you manage to have the em dash — in your keyboard ?

throwup238 8 hours ago | parent | next [-]

Does your friend have an iPhone? The default iOS keyboard has automatically converted double dashes into an emdash for at least seven years now.

QuercusMax 5 hours ago | parent [-]

I think Google docs does this too, which drives me up the wall when I'm trying to write `command --foo=bar` and it turns it into an M-dash which obviously doesn't work.

velcrovan 9 hours ago | parent | prev | next [-]

https://joeldueck.com/manually-type-punctuation.html

https://joeldueck.com/ai-is-right-about-em-dashes.html

ahofmann 9 hours ago | parent | prev | next [-]

Em dashes are used often by LLMs, because humans use them often. On mac keyboards its easily typed. I know this is oversimplifying the situation, but I don't see the usefulness of the constant witch-hunting for allegedly LLM-generated text. For text we are long beyond the point, where we can differenciate between human generated and machine generated. We're even at the point, where it gets somewhat hard to identify machine generated audio and visuals.

StilesCrisis 7 hours ago | parent | next [-]

I might not be able to spot ALL AI generated text, but I can definitely spot some. It's still kind of quirky.

vardalab 6 hours ago | parent | prev [-]

Yeah, I agree with you. I'm so tired of people complaining about AI-generated text without focusing on the content. Just don't read it if you don't like it. It's another level of when people complain how a website is not readable for them or some CSS rendering is wrong or whatever. How does it add to the discussion?

ink 9 hours ago | parent | prev | next [-]

On a Mac, it's alt-dash in case you weren't being facetious

snazz 9 hours ago | parent | next [-]

Extra pedantic: that’s the en dash, the em dash is option-shift-hyphen

macintux 9 hours ago | parent | prev [-]

Technically option-shift-dash. option-dash is an en-dash.

vient 4 hours ago | parent | prev | next [-]

On Windows it is Alt+0151. Harder to use than on Mac but definitely possible, I frequently use it.

On recent versions Shift+Win+- also work, and Win+- produces en dash.

wiredfool 4 hours ago | parent | prev | next [-]

I just type -- and jira fixes it.

dolebirchwood 5 hours ago | parent | prev | next [-]

I really despise that people like you ruined em dashes for the rest of us who have enjoyed using them.

bitwize 8 hours ago | parent | prev [-]

I use Compose - - - on Linux and my cellphone (Unexpected Keyboard). Mac is Alt-_.