Remix.run Logo
simonw 10 hours ago

> Not only does an agent not have the ability to evolve a specification over a multi-week period as it builds out its lower components, it also makes decisions upfront that it later doesn’t deviate from.

That's your job.

The great thing about coding agents is that you can tell them "change of design: all API interactions need to go through a new single class that does authentication and retries and rate-limit throttling" and... they'll track down dozens or even hundreds of places that need updating and fix them all.

(And the automated test suite will help them confirm that the refactoring worked properly, because naturally you had them construct an automated test suite when they built those original features, right?)

Going back to typing all of the code yourself (my interpretation of "writing by hand") because you don't have the agent-managerial skills to tell the coding agents how to clean up the mess they made feels short-sighted to me.

disgruntledphd2 10 hours ago | parent | next [-]

> (And the automated test suite will help them confirm that the refactoring worked properly, because naturally you had them construct an automated test suite when they built those original features, right?)

I dunno, maybe I have high standards but I generally find that the test suites generated by LLMs are both over and under determined. Over-determined in the sense that some of the tests are focused on implementation details, and under-determined in the sense that they don't test the conceptual things that a human might.

That being said, I've come across loads of human written tests that are very similar, so I can see where the agents are coming from.

You often mention that this is why you are getting good results from LLMs so it would be great if you could expand on how you do this at some point in the future.

simonw 10 hours ago | parent | next [-]

I work in Python which helps a lot because there are a TON of good examples of pytest tests floating around in the training data, including things like usage of fixture libraries for mocking external HTTP APIs and snapshot testing and other neat patterns.

Or I can say "use pytest-httpx to mock the endpoints" and Claude knows what I mean.

Keeping an eye on the tests is important. The most common anti-pattern I see is large amounts of duplicated test setup code - which isn't a huge deal, I'm much more more tolerant of duplicated logic in tests than I am in implementation, but it's still worth pushing back on.

"Refactor those tests to use pytest.mark.parametrize" and "extract the common setup into a pytest fixture" work really well there.

Generally though the best way to get good tests out of a coding agent is to make sure it's working in a project with an existing test suite that uses good patterns. Coding agents pick the existing patterns up without needing any extra prompting at all.

I find that once a project has clean basic tests the new tests added by the agents tend to match them in quality. It's similar to how working on large projects with a team of other developers work - keeping the code clean means when people look for examples of how to write a test they'll be pointed in the right direction.

One last tip I use a lot is this:

  Clone datasette/datasette-enrichments
  from GitHub to /tmp and imitate the
  testing patterns it uses
I do this all the time with different existing projects I've written - the quickest way to show an agent how you like something to be done is to have it look at an example.
disgruntledphd2 9 hours ago | parent | next [-]

> Generally though the best way to get good tests out of a coding agent is to make sure it's working in a project with an existing test suite that uses good patterns. Coding agents pick the existing patterns up without needing any extra prompting at all.

Yeah, this is where I too have seen better results. The worse ones have been in places where it was greenfield and I didn't have an amazing idea of how to write tests (a data person working on a django app).

Thanks for the information, that's super helpful!

thunspa 8 hours ago | parent | prev [-]

I work in Python as well and find Claude quite poor at writing proper tests, might be using it wrong. Just last week, I asked Opus to create a small integration test (with pre-existing examples) and it tried to create a 200-line file with 20 tests I didn't ask for.

I am not sure why, but it kept trying to do that, although I made several attempts.

Ended up writing it on my own, very odd. This was in Cursor, however.

jihadjihad 9 hours ago | parent | prev | next [-]

In my experience asking the model to construct an automated test suite, with no additional context, is asking for a bad time. You'll see tests for a custom exception class that you (or the LLM) wrote that check that the message argument can be overwritten by the caller, or that a class responds to a certain method, or some other pointless and/or tautological test.

If you start with an example file of tests that follow a pattern you like, along with the code the tests are for, it's pretty good at following along. Even adding a sentence to the prompt about avoiding tautological tests and focusing on the seams of functions/objects/whatever (integration tests) can get you pretty far to a solid test suite.

kaydub 7 hours ago | parent | next [-]

1 agent writes the tests, threads the needle.

Another agent reviews the tests, finds duplicate code, finds poor testing patterns, looks for tests that are only following the "happy path", ensures logic is actually tested and that you're not wasting time testing things like getters and setters. That agent writes up a report.

Give that report back to the agent that wrote the test or spin up a new agent and feed the report to it.

Don't do all of this blindly, actually read the report to make sure the llm is on the right path. Repeat that one or two times.

matltc 6 hours ago | parent | prev [-]

Yeah I've seen this too. Bangs out five hundred line unit test file, but half of them are as you describe.

Just writing one line in CLAUDE.md or similar saying "don't test library code; assume it is covered" works.

Half the battle with this stuff is realizing that these agents are VERY literal. The other half is paring down your spec/token usage without sacrificing clarity.

kaydub 7 hours ago | parent | prev | next [-]

Once the agent writes your tests, have another agent review them and ask that agent to look for pointless tests, to make sure testing is around more than just the "happy path", etc. etc.

Just like anything else in software, you have to iterate. The first pass is just to thread the needle.

wvenable 7 hours ago | parent | prev | next [-]

> I dunno, maybe I have high standards

I don't get it. I have insanely high standards so I don't let the LLM get away with not meeting my standards. Simple.

archagon 4 hours ago | parent | prev | next [-]

I get the sense that many programmers resent writing tests and see them as a checkbox item or even boilerplate, not a core part of their codebase. Writing great tests takes a lot of thought about the myriad of bizarre and interesting ways your code will run. I can’t imagine that prompting an LLM to “write tests for this code” will result in anything but the most trivial of smoke test suites.

Incidentally, I wonder if anyone has used LLMs to generate complex test scenarios described in prose, e.g. “write a test where thread 1 calls foo, then before hitting block X, thread 2 calls bar, then foo returns, then bar returns” or "write a test where the first network call Framework.foo makes returns response X, but the second call returns error Y, and ensure the daemon runs the appropriate mitigation code and clears/updates database state." How would they perform in this scenario? Would they add the appropriate shims, semaphores, test injection points, etc.?

touristtam 9 hours ago | parent | prev [-]

Embrace TDD? Write those tests and tell the agent to write the subject under test?

0xffff2 6 hours ago | parent [-]

Different strokes for different folks and all, but that sounds like automating all of the fun parts and doing all of the drudgery by hand. If the LLM is going to write anything, I'd much rather make it write the tests and do the implementation myself.

yakshaving_jgt 3 hours ago | parent [-]

This is a serious problem with professional software development — programmers see testing as a chore, and self-indulge in the implementation.

asadjb 8 hours ago | parent | prev | next [-]

Unfortunately I have started to feel that using AI to code - even with a well designed spec, ends up with code that; in the authors words, looks like

> [Agents write] units of changes that look good in isolation.

I have only been using agents for coding end-to-end for a few months now, but I think I've started to realise why the output doesn't feel that great to me.

Like you said; "it's my job" to create a well designed code base.

Without writing the code myself however, without feeling the rough edges of the abstractions I've written, without getting a sense of how things should change to make the code better architected, I just don't know how to make it better.

I've always worked in smaller increments, creating the small piece I know I need and then building on top of that. That process highlights the rough edges, the inconsistent abstractions, and that leads to a better codebase.

AI (it seems) decides on a direction and then writes 100s of LOC at one. It doesn't need to build abstractions because it can write the same piece of code a thousand times without caring.

I write one function at a time, and as soon I try to use it in a different context I realise a better abstraction. The AI just writes another function with 90% similar code.

WorldMaker 7 hours ago | parent | next [-]

The old classic mantra is "work smarter, not harder". LLMs are perfect for "work harder". They can produce bulk numbers of lines. They can help you brute force a problem space with more lines of code.

We expect the spec writing and prompt management to cover the "work smarter" bases, but part of the work smarter "loop" is hitting those points where "work harder" is about to happen, where you know you could solve a problem with 100s or 1000s of lines of code, pausing for a bit, and finding the smarter path/the shortcut/the better abstraction.

I've yet to see an "agentic loop" that works half as well as my well trained "work smarter loop" and my very human reaction to those points in time of "yeah, I simply don't want to work harder here and I don't think I need hundreds more lines of code to handle this thing, there has to be something smarter I can do".

In my opinion, the "best" PRs delete as much or more code than they add. In the cleanest LLM created PRs I've never seen an LLM propose a true removal that wasn't just "this code wasn't working according to the tests so I deleted the tests and the code" level mistakes.

AstroBen 7 hours ago | parent [-]

The used to be a saying of "the best programmers are lazy" - I think the opposite is now true

acessoproibido 7 hours ago | parent | prev [-]

I don't see why you can't use your approach of writing one function at a time, making it work in the context and then moving on with AI. Sure you can't tell it to do all that in one step but personally I really like not dealing with the boilerplate stuff and worrying more about the context and how to use my existing functions in different places

pgwhalen 10 hours ago | parent | prev | next [-]

> Going back to typing all of the code yourself (my interpretation of "writing by hand") because you don't have the agent-managerial skills to tell the coding agents how to clean up the mess they made feels short-sighted to me.

I increasingly feel a sort of "guilt" when going back and forth between agent-coding and writing it myself. When the agent didn't structure the code the way I wanted, or it just needs overall cleanup, my frustration will get the best of me and I will spend too much time writing code manually or refactoring using traditional tools (IntelliJ). It's clear to me that with current tooling some of this type of work is still necessary, but I'm trying to check myself about whether a certain task really requires my manual intervention, or whether the agent could manage it faster.

Knowing how to manage this back and forth reinforces a view I've seen you espouse: we have to practice and really understand agentic coding tools to get good at working with them, and it's a complete error to just complain and wait until they get "good enough" - they're already really good right now if you know how to manage them.

skerit 9 hours ago | parent | prev | next [-]

The article said:

> So I’m back to writing by hand for most things. Amazingly, I’m faster, more accurate, more creative, more productive, and more efficient than AI, when you price everything in, and not just code tokens per hour

At least he said "most things". I also did "most things" by hand, until Opus 4.5 came out. Now it's doing things in hours I would have worked an entire week on. But it's not a prompt-and-forget kind of thing, it needs hand holding.

Also, I have no idea _what_ agent he was using. OpenAI, Gemini, Claude, something local? And with a subscription, or paying by the token?

Because the way I'm using it, this only pays off because it's the 200$ Claude Max subscription. If I had to pay for the token (which once again: are hugely marked up), I would have been bankrupt.

kaydub 7 hours ago | parent [-]

The article and video just feels like another dev poo-pooing LLMs.

"vibe coding" didn't really become real until 2025, so how were they vibe coding for 2 years? 2 years ago I couldn't count on an llm to output JSON consistently.

Overall the article/video are SUPER ambiguous and frankly worthless.

yojat661 6 hours ago | parent | next [-]

Cursor and gpt 4 have been a thing from 2023. So, no, vibe coding didn't become real just last year.

9rx 6 hours ago | parent | prev [-]

I successfully vibe coded an app in 2023, soon after VS Code Copilot added the chat feature, although we obviously didn't call it that back then.

I remember being amazed and at the time thinking the game had changed. But I've never been able to replicate it since. Even the latest and greatest models seem to always go off and do something stupid that it can't figure out how to recover from without some serious handholding and critique.

LLMs are basically slot machines, though, so I suppose there has always been a chance of hitting the jackpot.

lunar_mycroft 4 hours ago | parent | prev | next [-]

> That's your job.

No, that isn't. To quote your own blog, his job is to "deliver code [he's] proven to work", not to manage AI agents. The author has determined that managing AI agents is not an effective way to deliver code in the long term.

> you don't have the agent-managerial skills to tell the coding agents how to clean up the mess they made

The author has years of experience with AI assisted coding. Is there any way we can check to see if someone is actually skilled at using these tools besides whether they report/studies measure that they do better with them than without?

candiddevmike 9 hours ago | parent | prev | next [-]

> Going back to typing all of the code yourself (my interpretation of "writing by hand") because you don't have the agent-managerial skills to tell the coding agents how to clean up the mess they made feels short-sighted to me.

Or those skills are a temporary side effect of the current SOTA and will be useless in the future, so honing them is pointless right now.

Agents shouldn't make messes, if they did what they say on the tin at least, and if folks are wasting considerable time cleaning them up, they should've just written the code themselves.

ap99 10 hours ago | parent | prev | next [-]

> That's your job.

Exactly.

AI assisted development isn't all or nothing.

We as a group and as individuals need to figure out the right blend of AI and human.

thesz 8 hours ago | parent | next [-]

  > AI assisted development isn't all or nothing.
  > We as a group and as individuals need to figure out the right blend of AI and human.
This is what makes current LLM debate very much like the strong typing debate about 15-20 years ago.

"We as a group need to figure out the right blend of strong static and weak dynamic typing."

One can look around and see where that old discussion brought us. In my opinion, nowhere, things are same as they were.

So, where will LLM-assisted coding bring us? By rhyming it with the static types, I see no other variants than "nowhere."

dwaltrip 2 hours ago | parent [-]

As a former “types are overrated” person, Typescript was my conversion moment.

For small projects, I don’t think it makes a huge difference.

But for large projects, I’d guess that most die-hard dynamic people who have tried typescript have now seen the light and find lots of benefits to static typing.

thesz an hour ago | parent [-]

I was on the other side, I thought types are indispensable. And I still do.

My own experience suggest that if you need to develop heavily multithreaded application, you should use Haskell and you need some MVars if you are working alone and you need software transactional memory (STM) if you are working as part of a team, two and more people.

STM makes stitching different parts of the parallel program together as easy as just writing sequential program - sequential coordination is delegated to STM. But, STM needs control of side effects, one should not write a file inside STM transaction, only before transaction is started or after transaction is finished.

Because of this, C#, F#, C++, C, Rust, Java and most of programming languages do not have a proper STM implementation.

For controlling (and combining) (side) effects one needs higher order types and partially instantiated types. These were already available in Haskell (ghc 6.4, 2005) at the time Rust was conceived (2009), for four years.

Did Rust do anything to have these? No. The authors were a little bit too concerned to reimplement what Henry Baker did at the beginning of 1990-s, if not before that.

Do Rust authors have plans to implement these? No, they have other things to do urgently to serve community better. As if making complex coordination of heavily parallel programs is not a priority at all.

This is where I get my "rhyme" from.

freedomben 9 hours ago | parent | prev | next [-]

Seriously. I've known for a very long time that our community has a serious problem with binary thinking, but AI has done more to reinforce that than anything I can think of in modern memory. Nearly every discussion I get into about AI is dead out of the gate because at least one person in the conversation has a binary view that it's either handwritten or vibe coded. They have an insanely difficult time imagining anything in the middle.

Vibe coding is the extreme end of using AI, while handwriting is the extreme end of not using AI. The optimal spot is somewhere in the middle. Where exactly that spot is, I think is still up for debate. But the debate is not progressed in any way by latching on to the extremes and assuming that they are the only options.

kaydub 7 hours ago | parent | next [-]

The "vibe coding" term is causing a lot of brain rot.

Because when I see people that are downplaying LLMs or the people describing their poor experiences it feels like they're trying to "vibe code" but they expect the LLM to automatically do EVERYTHING. They take it as a failure that they have to tell the LLM explicitly to do something a couple times. Or they take it as a problem that the LLM didn't "one shot" something.

bandrami 7 hours ago | parent [-]

I'd like it to take less time to correct than it takes me to type out the code I want and as of yet I haven't had that experience. Now, I don't do Python or JS, which I understand the LLMs are better at, but there's a whole lot of programming that isn't in Python or JS...

kaydub 7 hours ago | parent [-]

I've had success across quite a few languages, more than just python and js. I find it insanely hard to believe you can write code faster than the LLM, even if the LLM has to iterate a couple times.

But I'm thankful for you devs that are giving me job security.

bandrami 36 minutes ago | parent [-]

And that tells me you're on the dev end of the devops spectrum while I'm fully on the ops side. I write very small pieces of software (the time it takes to type them is never the bottleneck) that integrates in-house software with whatever services they have to actually interact with, which every LLM I've used does wrong the first fifteen or so times it tries (for some reason rtkit in particular absolutely flummoxes every single LLM I've ever given it to).

anonymars 8 hours ago | parent | prev | next [-]

I think you will find this is not specific to this community nor AI but any topic involving nuance and trade-offs without a right answer

For example, most political flamefests

9 hours ago | parent | prev [-]
[deleted]
kaydub 7 hours ago | parent | prev [-]

I'm only writing 5-10% of my own code at this point. The AI tools are good, it just seems like people that don't like them expect them to be 100% automatic with no hand holding.

Like people in here complaining about how poor the tests are... but did they start another agent to review the tests? Did they take that and iterate on the tests with multiple agents?

I can attest that the first pass of testing can often be shit. That's why you iterate.

Ososjjss 7 hours ago | parent [-]

> I can attest that the first pass of testing can often be shit. That's why you iterate.

So far, by the time I’m done iterating, I could have just written it myself. Typing takes like no time at all in aggregate. Especially with AI assisted autocomplete. I spend far more time reading and thinking (which I have to do to write a good spec for the AI anyways).

kaydub 7 hours ago | parent [-]

Nope, you couldn't have written it yourself in the same time. That's just a false assumption a lot of you like to make.

Ososjjss 3 hours ago | parent [-]

[dead]

dionian 8 hours ago | parent | prev | next [-]

I agree, as a pretty experienced coder, I wonder if the newer generation is just rolling with the first shot. I find myself having the AI rewrite things a slightly different way 2-3x per feature or maybe even 10x. Because i know quality when i see it, having done so much by hand and so much reading.

10 hours ago | parent | prev [-]
[deleted]