Remix.run Logo
anitil 2 days ago

On that latest episode of 'Security Cryptography Whatever' [0] they mention that the time spent on improving the harness (at the moment) end up being outperformed by the strategy of "wait for the next model". I doubt that will continue, but it broke my intuition about how to improve them

[0] https://securitycryptographywhatever.com/2026/03/25/ai-bug-f...

conception 2 days ago | parent | next [-]

This is basically how you should treat all AI dev. Working around AI model limits for something that will take 3-6 months of work has very little ROI compared to building what works today and just waiting and building what works tomorrow tomorrow.

sally_glance 2 days ago | parent | next [-]

This is the hard part - especially with larger initiatives, it takes quite a bit of work to evaluate what the current combination of harness + LLM is good at. Running experiments yourself is cumbersome and expensive, public benchmarks are flawed. I wish providers would release at least a set of blessed example trajectories alongside new models.

As it is, we're stuck with "yeah it seems this works well for bootstrapping a Next.js UI"...

thephyber 2 days ago | parent | prev [-]

This assumes AI model improvements will be predictable, which they won’t.

There are several simultaneous moving targets: the different models available at any point in time, the model complexity/ capability, the model price per token, the number of tokens used by the model for that query, the context size capabilities and prices, and even the evolution of the codebase. You can’t calculate comparative ROIs of model A today or model B next year unless these are far more predictable than they currently are.

jorvi 2 days ago | parent | prev | next [-]

That seems very unlikely.

Chinese AI vendors specifically pointed out that even a few gens ago there was maybe 5-15% more capability to squeeze out via training, but that the cost for this is extremely prohibitive and only US vendors have the capex to have enough compute for both inference and that level of training.

I'd take their word over someone that has a vested interested in pushing Anthropic's latest and greatest.

The real improvements are going to be in tooling and harnessing.

anitil a day ago | parent [-]

> The real improvements are going to be in tooling and harnessing

I don't have any special knowledge here, but the guy in the podcast (who worked/works with one of the big AI firms) is the one who made the claim. In the future when (if?) the speed of development slows I agree it would no longer be true

theptip 2 days ago | parent | prev | next [-]

It’s a good thing to keep in mind, but LLM + scaffolding is clearly superior. So if you just use vanilla LLMs you will always be behind.

I think the important thing is to avoid over-optimizing. Your scaffold, not avoid building one altogether.

fragmede 2 days ago | parent [-]

It's wild to me that a paragraph or 7 of plain English that amounts to "be good at things" is enough to make a material difference in the LLM's performance.

l33tman 2 days ago | parent | next [-]

As the base is an auto-regressive model that is capable of generating more or less any kind of text, it kind of makes sense though. It always has the capabilities, but you might want it to emulate a stupid analysis as well. So you're leading in with a text that describes what the rest of the text will be in a pretty real sense.

jtbayly 2 days ago | parent | prev | next [-]

I read once (so no idea if it is true) that in voice lessons, one of the most effective things you can do to improve people's technique is to tell them to pretend to be an opera singer.

chrisjj 2 days ago | parent | prev | next [-]

There will always be bosses who/which think telling workers to work well works well.

AlexCoventry 2 days ago | parent | prev [-]

They have no values of their own, so you have to direct their attention that way.

yorwba 2 days ago | parent | prev | next [-]

I think you took away the wrong lesson from that podcast:

I think there is work to be done on scaffolding the models better. This exponential right now reminds me of the exponential from CPU speeds going up until let’s say 2000 or something where you had these game developers who would develop really impressive games on the current thing of hardware and they do it by writing like really detailed intricate x86 instruction sequences for like just exactly whatever this, like, you know, whatever 486 can do, knowing full well that in 2 years, you know, the pen team is gonna be able to do this much faster and they didn’t need to do it. But like you need to do it now because you wanna sell your game today and like, yeah, you can’t just like wait and like have everyone be able to do this. And so I do think that there definitely is value in squeezing out all of the last little juice that you can from the current model.

Everything you can do today will eventually be obsoleted by some future technology, but if you need better results today, you actually have to do the work. If you just drop everything and wait for the singularity, you're just going to unnecessarily cap your potential in the meantime.

argee 2 days ago | parent | prev | next [-]

> it broke my intuition about how to improve them

Here we go again.

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Kinrany 2 days ago | parent | prev | next [-]

That only applies to workarounds for current limitations, no? Some things a harness can do will apply in the same way to future models.

bitexploder 2 days ago | parent | prev [-]

And if you have the better harness and the next model?

anitil 2 days ago | parent [-]

I would _hope_ that the double combo would be better, but honestly I have no idea

bitexploder 2 days ago | parent [-]

I do. It is better. I have done a lot of vuln research. I can get way better than one shot level results out of “inferior” models.