| ▲ | akhrail1996 2 hours ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session? The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | arialdomartini 42 minutes ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence. We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on. Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result. They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6. To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku. This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost. Should this be the case, I personally would not be surprised: - the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans. - Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | kybernetikos an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jaredklewis 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> what's the evidence What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on. In my experience, evidence for the efficacy of software engineering practices falls into two categories: - the intuitions of developers, based in their experiences. - scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering. Evidence for this LLM pattern is the same. Some developers have an intuition it works better. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Tarq0n 26 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In machine learning, ensembles of weaker models can outperform a single strong model because they have different distributions of errors. Machine learning models tend to have more pronounced bias in their design than LLMs though. So to me it makes sense to have models with different architecture/data/post training refine each other's answers. I have no idea whether adding the personas would be expected to make a difference though. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | totomz an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | hakanderyal an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
One added benefit is it allows you to throw more tokens to the problem. It’s the most impactful benefit even. Context & how LLMs work requires this. From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand. With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review. If the feature is complex, multiple round of reviews, from scratch. It works. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | est an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session? There's a 63 pages paper with mathematical proof if you really into this. https://arxiv.org/html/2601.03220v1 My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | palmotea an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session? Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jumploops an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process. Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR. Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA. What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done." "Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code. Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | awesome_dude an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I have been using different models for the same role - asking (say) Gemini, then, if I don't like the answer asking Claude, then telling each LLM what the other one said to see where it all ends up Well I was until the session limit for a week kicked in. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | troupo 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> produces better results than just... talking to one strong model in one session? I think the author admits that it doesn't, doesn't realise it and just goes on: --- start quote --- On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet --- end quote --- | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | imiric 2 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Evidence? My friend, most of the practices in this field are promoted and adopted based on hand-waving, feelings, and anecdata from influencers. Maybe you should write and share your own article to counter this one. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||