I think the theoretical answer here is this:

"Agents address the problem from independent angles, other agents try to refute what they found, and the run keeps iterating until the answers converge."

So you will be supplying the "ground truth" (test suite, detailed spec, whatever) and empower an agent to use it to guide the other agents. Currently a lot of people do this sequentially in the form of multiple code-review passes by fresh agent sessions looking at the work of previous sessions.

Adversarial models are a longstanding technique in ML so it makes sense they would try to go this way.

▲

KronisLV 2 hours ago | parent | next [-]

> Currently a lot of people do this sequentially in the form of multiple code-review passes by fresh agent sessions looking at the work of previous sessions.

Up until now I've used a review loop approach, where within a Claude Code session I just tell it to spawn three review sub-agents, each with context of what's going on and instructions to look over all of the changed code in search for serious/critical issues, but otherwise a more fresh look at things. It works really well for the most part (token usage aside): https://news.ycombinator.com/item?id=48277011

▲

vadansky 4 hours ago | parent | prev | next [-]

I don't know, maybe I'm doing it wrong but I feel LLMs add a slop debt, and each agent pass just exuberates it.

Like I had an LLM implement a spec and said it was done... Except it had a ton of `casts` everywhere. Okay, my bad, I should have been clear "NO CASTS", so I use the LLM to remove the casts, except it just kept making things more and more complicated and ugly.

It took me taking a break and having a shower thought to realize all the ugliness is because one type should have been broken up into 2, which would remove a ton of generics and code. But Claude never suggested that, it was always "we need at least one cast here, or we need 1000 LOC of generic factories". I tried multiple new sessions with various prompts too.

Maybe one day soon LLMs could pay off their own slop debt but at least right now I don't trust them to write code unseen.

Edit: Maybe the correct action should have been to delete everything and make it re-write everything from scratch with the clear "NO CASTS EVER" rule. But still the point is feels like having LLM clean up after an LLM doesn't work well enough to just have keep it in a loop and never look at what it does.

	▲	zmj 2 minutes ago \| parent \| next [-]
		If you want hard rules, use deterministic tools. Prompts are for fuzzy guidance.
	▲	highwaylights 4 hours ago \| parent \| prev \| next [-]
		This matches my experience. I've had to put a fair chunk of effort in to skills that will run deterministic mechanisms to unslop a codebase (cyclomatic complexity grading has been really helpful here) as invariably some amount of guidance around principles will be missed over time. I've found it does help, though. Certainly I'm getting overall better results from Flash and Sonnet over multiple runs for fairly modest token increases. GPT 5.5 less so, but that's because it scores better in a first pass. I won't really know until I gauge it at the end of my sub month which has been more cost efficient for me all things considered.
	▲	vinnymac 4 hours ago \| parent \| prev \| next [-]
		The problem is that we have an ever growing and large number of constraints, and not following even a single one means the result is sloppy. I don’t see them fixing this any time soon, and thus human in the loop is a requirement to use these tools effectively. That is unless you love your slot machine dopamine rush enough to ignore quality gates and respect for your peers time.
	▲	tomjakubowski 3 hours ago \| parent \| prev \| next [-]
		I've been reading writing Rust for a long while now, since before 1.0. I'm capable of critically evaluating Rust code. I'm also a happy Claude Code user, mostly for lightweight uses like generating scaffolding, prototyping, and debugging. The pure LLM, no human intervention vibe-coded PRs on Bun since the vibe-rewrite to Rust contain the worst coding horrors I've seen in 20 years of programming. Setting aside the quality of the change itself (I would have done it differently, for sure: it is pretty straightforward to build a safe abstraction out of this type), the utterly pointless "source-text consistency test" added here is easily the worst example of "test repeats implementation" I have seen in my career: https://github.com/oven-sh/bun/pull/30728/files#diff-863477b...
	▲	implexa_founder 2 hours ago \| parent \| prev [-]
		[flagged]

▲

Garlef an hour ago | parent | prev | next [-]

Doesn't help if the wrong design is implemented correctly.

▲

tsunamifury 5 hours ago | parent | prev [-]

Ground truth is not consensus, it has to be graded against what actually works for the original goal. Plenty of scenarios with AI and Humans can result in consensus around incorrectness.

▲

adamtaylor_13 5 hours ago | parent [-]

While pedantically correct, I think the comment above assumed that you've correctly specified the work. If you can't correctly specify your work, AI agents are just going to help you get a non-solution faster.

▲

tsunamifury 5 hours ago | parent [-]

Isn't coding the act of specificying the work to a processor? And AI agents are supposed to bridge the gap with intelligence from less specificed to more specified or possibly even more intelligent and alternate implementations?

	▲	wrs an hour ago \| parent \| next [-]
		What I meant by "ground truth" is that it is not fuzzy, not AI-evaluated, and not a consensus. The test suite passes or it doesn't. The codebase lints or it doesn't. The performance improved or it didn't. An agent can help you create the specification, but it's up to you to know whether it's correctly testing that you got the result you wanted.
	▲	adamtaylor_13 4 hours ago \| parent \| prev [-]
		Yep. And yet, there's still some level of specification you have to do.