We're working on a large Rust codebase, heavily assisted development with Claude and Codex, and one critical workflow is after you have written a spec, have the other LLM critique it thoroughly.

This back and forth will take quite a while, but the resulting implementation plan will be 10x better than the original.

You can automate this by giving Codex a goal, and a skill to call Claude to review the implementation spec until they both agree it's done.

Then, for critical code, have them both implement the spec in a worktree, then BOTH critique each other's implementation.

More often than not, Claude will say to take 2 or 3 pieces from it's design over to Codex, but ship the Codex implementation.

▲

Aurornis 5 hours ago | parent | next [-]

I take this idea even further: After the LLMs have critiqued each other, I introduce a third critique and review it myself as a human. This third party review is most effective at highlighting problems that the LLMs miss, in my experience.

Jokes aside, I agree about having LLMs iterate. Bouncing between GPT and Opus is good in my experience, but even having the same LLM review its own output in a new session started fresh without context will surface a lot of problems.

This process takes a lot of tokens and a lot of time, which is find because I’m reviewing and editing everything myself during that time.

▲

knivets 5 hours ago | parent | prev | next [-]

This is astrology for devs.

	▲	embedding-shape 2 hours ago \| parent [-]
		Unless you can somehow provide some arguments against it, I feel like you're the one who is trying to cargo-cult stuff here. Say what you will with proper reasoning or arguments if you feel compelled, tired reddit-commentary like that helps no one.

▲

giancarlostoro 5 hours ago | parent | prev | next [-]

This is precisely how I used to use Beads before I made GuardRails (I wanted something slightly simpler, but similar with more 'guard rails'). I braindump everything I want to build, I ask Claude to do market level research. I then ask Claude to ask clarifying questions, when I ask Claude to be critical of its conclusions and provide the top options and to justify it. I also question Claude and say its okay to disagree with me, be critical, I just want to understand.

By the end you have piecemeal "tickets" for your coding agent, if you have multiple developers you can sync them all up into github, and someone could take some locally, or you can just have Claude work on all of them with subagents. The key feature there is because its all piecemeal the context stays per task.

Then I run a /loop 15m If you're currently working ignore this. Start on the next task in gur if you have not. If you finished all work and cannot pass one gate, work on the next available task.

(Note: gur is my shorthand for GuardRails)

I also added a concept called "gates" so a task cannot complete without an attached gate, gates are arbitrary, they can be reused but when assigned to a task those specific assignments are unique per task. A task is basically anything you want it to be: unit test, try building it, or even seek human confirmation. At least when I was using Beads it did not have "gates" but I'm not sure if it has added anything like it since I stopped using Beads.

Claude will ignore the loop if it's currently working, and when its "out of work" it will review all available tasks.

If anyone's curious its MIT Licensed and on GitHub:

https://github.com/Giancarlos/guardrails

▲

motoboi 6 hours ago | parent | prev | next [-]

I strongly believe you don’t need to call another model for that. The same model can do result fine. Just not as part of the same context.

I mean that if you ask codex on gpt 5.5 to submit to a plan reviewer subagent that uses gpt5.5, this is enough to have a very good reviewing and reassessment of the plan.

My hypothesis is that it’s even better than opus.

The reason why submitting the product of one LLM to another to review is that you need a fresh trajectory. The previous context might have “guided” the planer into some bias. Removing the context is enough to break free from that trajectory and start fresh.

▲

ai_fry_ur_brain 6 hours ago | parent | prev | next [-]

I hate how seriously people take the output of an LLMs or how reliable they think it is.

Have Claude produce that spec 10 times, use the same prompt and same context. Identical requests, but you'll get 10 unique answers that wil contradict each other with each response seeming extermely confident.

Its scary how confident you people are in these outputs.

▲

CrazyStat 6 hours ago | parent | next [-]

If you ask 10 different humans to produce the spec with the same information (prompt and context) they will also produce 10 unique answers that will contradict each other and (depending on who you asked) may be just as confident.

There are real decisions to be made when going from a vague prompt to a spec. It's not surprising that an LLM would produce different specs for the same work on different runs. If the prompt already contained answers to all the decision points that come up when writing the spec then the prompt would already be the spec itself.

▲

dxxvi 7 minutes ago | parent | next [-]

> It's not surprising that an LLM would produce different specs for the same work on different runs This is what I don't understand: AI is a computer program with its own data. If we give the same input to that computer program every time, why does it produce different outputs every time? Or does the input include LLM data + our prompt + some random data that computer program picks from its Internet search?

▲

b40d-48b2-979e 6 hours ago | parent | prev | next [-]

LLMs aren't people. They don't reason. They're token generators, a black box. Your analogy falls on its face with any scrutiny.

▲

CrazyStat 6 hours ago | parent | next [-]

I didn’t claim that LLMs are people or that they reason.

If the behavior of the llm is the same as the behavior of reasonable people then the behavior of the llm is reasonable, regardless of how black of a box they generate tokens out of.

Reasonable people will generate divergent specs for the same prompt. Thus it is reasonable for an LLM to generate divergent specs out of the same prompt.

Edit: I use “reasonable” here in the legal sense of the “reasonable person” standard, not to imply any reasoning process.

▲

b40d-48b2-979e 5 hours ago | parent [-]

[flagged]

	▲	CrazyStat 5 hours ago \| parent [-]
		Please point to where in my initial comment I indicated that LLMs are human or reason. If you are unable to do so please withdraw your accusation of gaslighting, a serious form of psychological abuse, and apologize.

▲

jatora 6 hours ago | parent | prev | next [-]

it's an analogy, it didnt fall on its face at all. it's just a comparison to highlight the point being made was nonsensical. example: you're just a next action generator controlled by trillions of cells and subconscious dna-based behavior. a black box.

▲

svieira 6 hours ago | parent | next [-]

> you're just a next action generator controlled by trillions of cells and subconscious dna-based behavior.

With moral agency and the ability to learn (even if we presume you are correct, which I don't think you are).

	▲	jatora 4 hours ago \| parent [-]
		moral agency and the ability to learn are implicit in the description you quoted. this isn't some special superpower, all animals have the ability to learn, and many have moral agency. these aren't human specific traits

▲

b40d-48b2-979e 6 hours ago | parent | prev [-]

Reductio ad absurdum.

	▲	jatora 4 hours ago \| parent [-]
		exactly my point lol

▲

dnautics 6 hours ago | parent | prev | next [-]

LLMs do reason (they just sometimes don't reason well).

I assure you I've met many devs and "engineers" that reason less than LLMs, and are black boxes, especially in terms of the code they write.

	▲	claytongulick an hour ago \| parent [-]
		> LLMs do reason No, they don't. They are token predictors that use statistical techniques to emit the randomly weighted next most likely token given the previous token list. The result is a strange mimic of human reasoning, because the tokens it predicts are trained on strings that were produced by humans that were reasoning, but that's not the same thing. Human cognition is complex and poorly understood, and the nature of the mind is an area of study almost as old as consciousness itself. We don't know exactly how it works, or what its exact relationship to the brain is, but we do know that it is not a simple token predictor. LLMs, by their very nature are constrained to the concept of language and the relationship between existing words in a corpus. This is a box they can not escape. Modern neuroscience suggests that the human brain is much more vast than that, and in many ways looks like it is constrained by language, but certainly not limited to it.

▲

Jtarii 4 hours ago | parent | prev [-]

They very obviously reason.

	▲	dnautics 2 hours ago \| parent [-]
		it's kind of crazy to think that the transformer architecture can't encode some primitive form of reasoning.

▲

olafmol 6 hours ago | parent | prev | next [-]

An LLM should not "generate specs", a human should. The LLM can work from the specs. It can never infer meaning from a vague prompt. If so, it will start guessing. Every human that ever did functional specification or information analysis at some point knows this. Or has learned the hard way, something with assumptions and asses ;)

	▲	dist-epoch 5 hours ago \| parent [-]
		The guessing of a LLM for a vague prompt is better than the one of your average developer. A prompt like "write these two files on disk" will very likely make the LLM do some sort of an atomic write/swap operation, unlike the average developer which will just write the two files and maybe later encounter a race condition bug. You can argue the LLM output is overkill, but it will also be more robust on average.

▲

claytongulick an hour ago | parent | prev | next [-]

> If you ask 10 different humans to produce the spec with the same information (prompt and context) they will also produce 10 unique answers

But they didn't ask humans, they asked a machine. We expect our machines to behave in predictable ways.

> If the prompt already contained answers to all the decision points that come up when writing the spec then the prompt would already be the spec itself.

This is one of the best arguments against using LLMs I've seen.

It reduces to the classic argument- at the point where you've described a problem and solution in sufficient detail to be confident in the results, you've invented a programming language.

▲

skydhash 6 hours ago | parent | prev [-]

So what’s most important is knowing those parameters and the ranges of values, not having the final result. A human, after producing a specs, can the provide the mental model of how he created the specs. Where the inflection points are and what the range of valid results.

What has always mattered is how you decide the specs, not the specs in themselves.

▲

nullsanity 6 hours ago | parent | prev | next [-]

[dead]

▲

jatora 6 hours ago | parent | prev | next [-]

[flagged]

▲

Robdel12 6 hours ago | parent | prev [-]

Imagine making this your entire identity

▲

slopinthebag an hour ago | parent | prev | next [-]

It's incredible how much developers will do to avoid having to look at or think about code.

▲

AnimalMuppet 5 hours ago | parent | prev [-]

The return of pair programming.