| ▲ | dakiol 13 hours ago |
| Honest question: if you're using multiple agents, it's usually to produce not a dozen lines of code. It's to produce a big enough feature spanning multiple files, modules and entry points, with tests and all. So far so good. But once that feature is written by the agents... wouldn't you review it? Like reading line by line what's going on and detecting if something is off? And wouldn't that part, the manual reviewing, take an enormous amount of time compare to the time it took the agents to produce it? (you know, it's more difficult to read other people's/machine code than to write it yourself)... meaning all the productivity gained is thrown out the door. Unless you don't review every generated line manually, and instead rely on, let's say, UI e2e testing, or perhaps unit testing (that the agents also wrote). I don't know, perhaps we are past the phase of "double check what agents write" and are now in the phase of "ship it. if it breaks, let agents fix it, no manual debugging needed!" ? |
|
| ▲ | Leynos 12 hours ago | parent | next [-] |
| Here's what I suggest: Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan. Enforce single responsibility, cqrs, domain segregation, etc. Make the code as easy for you to reason about as possible. Enforce domain naming and function / variable naming conventions to make the code as easy to talk about as possible. Use code review bots (Sourcery, CodeRabbit, and Codescene). They catch the small things (violations of contract, antipatterns, etc.) and the large (ux concerns, architectural flaws, etc.). Go all in on linting. Make the rules as strict as possible, and tell the review bots to call out rule subversions. Write your own lints for the things the review bots are complaining about regularly that aren't caught by lints. Use BDD alongside unit tests, read the .feature files before the build and give feedback. Use property testing as part of your normal testing strategy. Snapshot testing, e2e testing with mitm proxies, etc. For functions of any non-trivial complexity, consider bounded or unbounded proofs, model checking or undefined behaviour testing. I'm looking into mutation testing and fuzzing too, but I am still learning. Pause for frequent code audits. Ask an agent to audit for code duplication, redundancy, poor assumptions, architectural or domain violations, TOCTOU violations. Give yourself maintenance sprints where you pay down debt before resuming new features. The beauty of agentic coding is, suddenly you have time for all of this. |
| |
| ▲ | sigotirandolas an hour ago | parent | next [-] | | To be devil's advocate: Many of those tools are overpowered unless you have a very complex project that many people depend on. The AI tools will catch the most obvious issues, but will not help you with the most important aspects (e.g. whether you project is useful, or the UX is good). In fact, having this complexity from the start may kneecap you (the "code is a liability" cliché). You may be "shipping a lot of PRs" and "implementing solid engineering practices", but how do you know if that is getting closer to what you value? How do you know that this is not actually slowing your down? | |
| ▲ | dominotw 12 hours ago | parent | prev | next [-] | | > Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan. I feel like i am a bit stupid to be not able to do this. my process is more iterative. i start working on a feature then i disocover some other function thats silightly related. go refactor into commmon code then proceed with original task. sometimes i stop midway and see if this can be done with a libarary somewhere and go look at example. i take many detours like these. I am never working on a single task like a robot. i dont want claude to work like that either .That seems so opposite of how my brain works. what i am missing. | | |
| ▲ | Leynos 10 hours ago | parent | next [-] | | Again, here's what works for me. When I get an idea for something I want to build, I will usually spend time talking to ChatGPT about it. I'll request deep research on existing implementations, relevant technologies and algorithms, and a survey of literature. I find NotebookLM helps a lot at this point, as does Elevenreader (I tend to listen to these reports while walking or doing the dishes or what have you). I feed all of those into ChatGPT Deep Research along with my own thoughts about the direction the system, and ask it to produce a design document. That gets me something like this: https://github.com/leynos/spycatcher-harness/blob/main/docs/... If I need further revisions, I'll ask Codex or Claude Code to do those. Finally, I break that down into a roadmap of phases, steps and achievable tasks using a prompt that defines what I want from each of those. That gets me this: https://github.com/leynos/spycatcher-harness/blob/main/docs/... Then I use an adapted version of OpenAI's execplans recipe to plan out each task (https://github.com/leynos/agent-helper-scripts/blob/main/ski...). The task plans end up looking like this: https://github.com/leynos/spycatcher-harness/blob/main/docs/... At the moment, I use Opus or GPT-5.4 on high to generate those plans, and Sonnet or GPT-5.4 medium to implement. The roadmap and the design are definitely not set in stone. Each step is a learning opportunity, and I'll often change the direction of the project based on what I learn during the planning and implementation. And of course, this is just what works for me. The fun of the last few months has been everyone finding out what works for them. | |
| ▲ | hirvi74 10 hours ago | parent | prev [-] | | You seem to work a lot like how I do. If that is being stupid, then well, count me in too. To be honest, if I had to go through all the work of planning, scope, escalation criteria, etc., then I would probably be better off just writing the damn code myself at that point. |
| |
| ▲ | bmurphy1976 8 hours ago | parent | prev [-] | | Can't upvote you enough. This is the way. You aren't vibe coding slop you have built an engineering process that works even if the tools aren't always reliable. This is the same way you build out a functioning and highly effective team of humans. The only obvious bit you didn't cover was extensive documentation including historical records of various investigations, debug sessions and technical decisions. |
|
|
| ▲ | Salgat 13 hours ago | parent | prev | next [-] |
| This is the biggest bottleneck for me. What's worse is that LLMs have a bad habit of being very verbose and rewriting things that don't need to be touched, so the surface area for change is much larger. |
| |
| ▲ | mohsen1 an hour ago | parent | next [-] | | I highly recommend adding `/simplify` to your workflow. It walks back over-engineerings quite often for me. | |
| ▲ | sheept 8 hours ago | parent | prev | next [-] | | Not only that, but LLMs do a disservice to themselves by writing inconcise code, decorating lines with redundant comments, which wastes their context the next time they work with it | |
| ▲ | cyanydeez 12 hours ago | parent | prev [-] | | It's kind weird; I jumped on the vibe coding opencode bandwagon but using local 395+ w/128; qwen coder. Now, it takes a bit to get the first tokens flowing, and and the cache works well enough to get it going, but it's not fast enough to just set it and forget it and it's clear when it goes in an absurd direction and either deviates from my intention or simply loads some context whereitshould have followed a pattern, whatever. I'm sure these larger models are both faster and more cogent, but its also clear what matter is managing it's side tracks and cutting them short. Then I started seeing the deeper problematic pattern. Agents arn't there to increase the multifactor of production; their real purpose is to shorten context to manageable levels. In effect, they're basically try to reduce the odds of longer context poisoning. So, if we boil down the probabilty of any given token triggering the wrong subcontext, it's clear that the greater the context, the greater the odds of a poison substitution. Then that's really the problematic issue every model is going to contend with because there's zero reality in which a single model is good enough. So now you're onto agents, breaking a problem into more manageable subcontext and trying to put that back into the larger context gracefully, etc. Then that fails, because there's zero consistent determinism, so you end up at the harness, trying to herd the cats. This is all before you realize that these businesses can't just keep throwing GPUs at everything, because the problem isn't computing bound, it's contextual/DAG the same way a brain is limited. We all got intelligence and use several orders of magnitude less energy, doing mostly the same thing. |
|
|
| ▲ | jwilliams 4 hours ago | parent | prev | next [-] |
| It’s a blend. There are plenty of changes in a production system that don’t necessarily need human review. Adding a help link. Fixing a typo. Maybe upgrades with strong CI/CD or simple ui improvements or safe experiments. There are features you can skip safely behind feature flags or staged releases. As you push in you fine with the right tooling it can be a lot. If you break it down often quite a bit can be deployed safely with minimal human intervention (depends naturally on the domain, but for a lot of systems). I’m aiming to revamp the while process - I wrote a little on it here : https://jonathannen.com/building-towards-100-prs-a-day/ |
|
| ▲ | browningstreet 12 hours ago | parent | prev | next [-] |
| I use coding agents to produce a lot of code that I don’t ship. But I do ship the output of the code. |
|
| ▲ | keeda 11 hours ago | parent | prev | next [-] |
| > you know, it's more difficult to read other people's/machine code than to write it yourself Not at all, it's just a skill that gets easier with practice. Generally if you're in the position to review a lot of PR's, you get proficient at it pretty quickly. It's even easier when you know the context of what the code is trying to do, which is almost always the case when e.g. reviewing your team-mates' PR's or the code you asked the AI to write. As I've said before (e.g. https://news.ycombinator.com/item?id=47401494), I find reviewing AI-generated code very lightweight because I tend to decompose tasks to a level where I know what the code should look like, and so the rare issues that crop up quickly stand out. I also rely on comprehensive tests and I review the test cases more closely than the code. That is still a huge amount of time-savings, especially as the scope of tasks has gone from a functions to entire modules. That said, I'm not slinging multiple agents at a time, so my throughput with AI is way higher than without AI, but not nearly as much as some credible reports I've heard. I'm not sure they personally review the code (e.g. they have agents review it?) but they do have strategies for correctness. |
| |
| ▲ | nprateem 4 hours ago | parent [-] | | I'll often run 4 or 5 agents in parallel. I review all the code. Some agents will be developing plans for the next feature, but there can sometimes be up to 4 coding. These are typically a mix between trivial bug fixes and 2 larger but non-overlapping features. For very deep refactoring I'll only have a single agent run. Code reviews are generally simple since nothing of any significance is done without a plan. First I run the new code to see if it works. Then I glance at diffs and can quickly ignore the trivial var/class renames, new class attributes, etc leaving me to focus on new significant code. If I'm reviewing feature A I'll ignore feature B code at this point. Merge what I can of feature A then repeat for feature B, etc. This is all backed by a test suite I spot check and linters for eg required security classes. Periodically we'll review the codebase for vulnerabilities (eg incorrectly scoped db queries, etc), and redundant/cheating tests. But the keys to multiple concurrent agents are plans where you're in control ("use the existing mixin", "nonsense, do it like this" etc) and non-overlapping tasks. This makes reviewing PRs feasible. |
|
|
| ▲ | MattGaiser 12 hours ago | parent | prev [-] |
| Yep. In many cases I am just reviewing test cases it generated now. > if it breaks, let agents fix it, no manual debugging needed!" ? Pretty trivial to have every Sentry issue have an immediate first pass by AI now to attempt to solve the bug. |