| Infinitely agree with all. I was skeptical, and then tried Opus 4.5 and was blown away. Codex with 5.0 and 5.1 wasn't great, but 5.2 is big improvement. I can't do code without it because there's no point. Time and quality with the right constraints, you're going to get better code. And same thought with both procrastination because of not knowing where to start, but also getting stuck in the middle and not knowing where to go. Literally never happens anymore. Having discussions with it for doing the planning and different options for implementations, and you get to the end with a good design description and then, what's the point of writing the code yourself when with that design, it's going to write it quickly and matching the agreements. |
| |
| ▲ | jackschultz 19 hours ago | parent | next [-] | | Sure, but the end of this post [0] is where I'm at. I don't feel the need or want to write the code when I can spend my time doing the other parts that are much more interesting and valuable. > Emil concluded his article like this: > JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent.
> But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking.
> That’s probably the right division of labor. >I couldn’t agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what’s left to be a much more valuable use of my time. [0] https://simonwillison.net/2025/Dec/14/justhtml/ | | |
| ▲ | culopatin 18 hours ago | parent [-] | | But are those tests relevant? I tried using LLMs to write tests at work and whenever I review them I end up asking it “Ok great, passes the test, but is the test relevant? Does it test anything useful?” And I get a “Oh yeah, you’re right, this test is pointless” | | |
| ▲ | manmal 17 hours ago | parent | next [-] | | Keep track of test coverage and ask it to delete tests without lowering coverage by more than let’s say 0.01 percent points. If you have a script that gives it only the test coverage, and a file with all tests including line number ranges, it is more or less a dumb task it can work on for hours, without actually reading the files (which would fill context too quickly). | | |
| ▲ | gaigalas 16 hours ago | parent [-] | | That does not work as advertised. If you leave an agent for hours trying to increase coverage by percentage without further guiding instructions you will end up with lots of garbage. In order to achieve this, you need several distinct loops. One that creates tests (there will be garbage), one that consolidates redundant tests, one that parametrizes repetitive tests, and so on. Agents create redundant tests for all sorts of reasons. Maybe they're trying a hard to reach line and leave several attempts behind. Or maybe they "get creative" and try to guess what is uncovered instead of actually following the coverage report, etc. Less capable models are actually better at doing this. They're faster, don't "get creative" with weird ideas mid-task and cost less. Just make them work one test at the time. Spawn, do one test that verifiably increases overall coverage, exit. Once you reach a treshold, start the consolidating loop: pick a redundant pair of tests, consolidate, exit. And so on... Of course, you can use a powerful model and babysit it as well. A few disambiguating questions and interruptions will guide them well. If you want true unattended though, it's damn hard to get stable results. |
| |
| ▲ | tlarkworthy 10 hours ago | parent | prev | next [-] | | We fixed this at work by instructing it to maximize coverage with minimal tests, which is closer to our coding style. | |
| ▲ | elbear 9 hours ago | parent | prev | next [-] | | Those tests were written by people. That's why they were confident that what the LLM implemented was correct. | | |
| ▲ | jackschultz 2 hours ago | parent [-] | | Meta about how important context is. People see LLMs and tons of tests tests written in the same sentence, and think that shows how models love writing pointless tests. Rather than realizing that the tests are standard and people written to show that the model wrote code that is validated by a currently trusted source. Shows the importance for us to always write comments that humans are going to read with the right context is _very_ similar to how we need to interact with LLMs. And if we fail to communicate with humans, clearly we're going to fail with models. |
| |
| ▲ | wahnfrieden 18 hours ago | parent | prev [-] | | Yes Skill issue... And perhaps the wrong model + harness |
|
| |
| ▲ | scottyah 15 hours ago | parent | prev | next [-] | | It's the semantics of "can", where it is used to suggest feasibility. When I moved and got a new commute, I still "could" bike to work, but it went from 30min to an hour and a half each way. While technically possible, I would have had to sacrifice a lot when losing two hours a day- laundry, cooking dinner, downtime. I always said I "can't really" bike to work, but there is a lot of context lost. | | | |
| ▲ | zamadatix 17 hours ago | parent | prev | next [-] | | "Can" is too overloaded a word even with context provided, ranging from places like "could conceivably be achieved" to "usually possible". The only hint you can dig out is where they might have limits feasibility around it. E.g. "I can fly first class all the time (if I limit the number of flights and spend an unreasonable portion of my weath on tickets)" is typically less useful an interpretation than "I can fly first class all the time (frequently without concern, because I'm very well off)", but you have to figure out which they are trying to say (which isn't always easy). | |
| ▲ | wahnfrieden 18 hours ago | parent | prev [-] | | I can't without seriously sacrificing productivity. (I've been coding for 30 years.) |
|