You can always tell claude to use red-green-refactor and that really is a step-up from "yeah don't forget to write tests and make sure they pass" at the end of the prompt, sure. But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.

▲

magicalist 8 hours ago | parent | next [-]

> But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.

I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.

Reward hacking is very real and hard to guard against.

▲

egeozcan 7 hours ago | parent | next [-]

The trick is, with the setup I mentioned, you change the rewards.

The concept is:

Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.

Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.

Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.

It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.

▲

seer 18 minutes ago | parent | next [-]

This seems quite amazing really, thanks for sharing

What is the scope of projects / features you’ve seen this be successful at?

Do you have a step before where an agent verifies that your new feature spec is not contradictory, ambiguous etc. Maybe as reviewed with regards to all the current feature sets?

Do you make this a cycle per step - by breaking down the feature to small implementable and verifiable sub-features and coding them in sequence, or do you tell it to write all the tests first and then have at it with implementation and refactoring?

Why not refactor-red-green-refactor cycle? E.g. a lot of the time it is worth refactoring the existing code first, to make a new implementation easier, is it worth encoding this into the harness?

▲

w4yai 5 hours ago | parent | prev | next [-]

You guys are describing wonderful things, but I've yet to see any implementation. I tried coding my own agents, yet the results were disappointing.

What kind of setup do you use ? Can you share ? How much does it cost ?

▲

throwaway7783 3 hours ago | parent | next [-]

We have a very uncomplicated setup with claude code. A CLAUDE.md with instructions and notes about the repo and how to run stuff. We also do code reviews with Claude Code, but in a separate session.

It works wonderfully well. Costs about $200USD per developer per month as of now.

▲

dworks 5 hours ago | parent | prev | next [-]

rlm-workflow does all that TDD for you: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

(I built it)

▲

cheema33 3 hours ago | parent | next [-]

Why make powershell a requirement? I like powershell, but Python is very common and already installed on many dev systems.

▲

_ink_ 5 hours ago | parent | prev [-]

Thanks for sharing. What does RLM stand for? Any idea why the socket security test fails?

	▲	stavros 5 hours ago \| parent [-]
		Recursive language models: https://github.com/doubleuuser/rlm-workflow

▲

aprdm 3 hours ago | parent | prev | next [-]

If you are not spending 5-10k dollars a month for interesting projects, you likely won't see interesting results

▲

cube00 2 hours ago | parent | next [-]

Sounds a lot like paying for online ads, they don't work because you're not paying enough, when in reality bots, scrapers and now agents are just running up all the clicks.

You pay more to try and get above that noise and hope you'll reach an actual human.

The new "fast mode" that burns tokens at 6 times the rate is just scary because that's what everyone still soon say we all need to be using to get results.

	▲	zarzavat an hour ago \| parent [-]
		It feels like everyone's gone mad. Here I am mostly writing code by hand, with some AI assistant help. I have a Claude subscription but only use it occasionally because it can take more time to review and fix the generated code as it would to hand-write it. Claude only saves me time on a minority of tasks where it's faster to prompt than hand-write. And then I read about people spending hundreds or thousands of dollars a month on this stuff. Doesn't that turn your codebase into an unreadable mess?

▲

mrbungie 3 hours ago | parent | prev [-]

I can't really tell if this is sarcasm or not.

▲

canadiantim 4 hours ago | parent | prev [-]

Check out Mike Pocock’s work, he’s done excellent work writing about red green refactor and has a GitHub repo for his skills. Read and take what you need from his tdd skill and incorporate it into your own tdd skill tailored for your project.

▲

nojito 3 hours ago | parent [-]

This is just ai slop. If you follow what the actual designers of Claude/GPT tell you it flys in the face of building out over engineered harnesses for agents.

▲

throwaway7783 3 hours ago | parent | next [-]

I agree with this. There is not a lot of harnesses/wrapping needed for Claude Code.

	▲	canadiantim 2 hours ago \| parent [-]
		You don't need a harness beyond Claude Code, but honestly it's foolish to think you shouldn't be building out extra skills to help your workflow. A TDD skill that does red-green-refactoring is using Claude Code exactly as how it's meant to be used. They pioneered skills.

▲

canadiantim 3 hours ago | parent | prev [-]

Works better than standard claude / gpt, which doesn't do red-green-refactor. Doesn't seem like slop when it meaningfully changes the results for the better, consistently. Really is a game-changer. You should consider trying it.

▲

nojito 2 hours ago | parent [-]

I do do TDD but using skills in this way is an anti-pattern for a multitude of reasons.

	▲	canadiantim 2 hours ago \| parent [-]
		I don't think just saying it's an anti-pattern for a multitude of reasons and then not naming any is sufficiently going to convince anyone it's an anti-pattern. This is in fact precisely what skills is meant for and is the opposite of an anti-pattern, but more like best practice now. It's explicitly using the skills framework precisely how it was meant to be used.

▲

tomtom1337 7 hours ago | parent | prev | next [-]

This is very interesting, but like sibling comments, I'm very curious as to how you run this in practice. Do you just tell Claude/Copilot to do what you describe?

And do you have any prompts to share?

	▲	throwaway7783 3 hours ago \| parent \| next [-]
		You don't need most of this. Prompts are also normally what you would say to another engineer. * There is a lot of duplication between A & B. Refactor this. * Look at ticket X and give me a root cause * Add support for three new types of credentials - Basic Auth, Bearer Token and OAuth Client Creds Claude.md has stuff like "Here's how you run the frontend. here's how u run backend. This module support frontend. That module is batch jobs. Always start commit messages with ticket number. Always run compile at the top level. When you make code changes, always add tests" etc etc
	▲	Exoristos an hour ago \| parent \| prev [-]
		They never do.

▲

esperent 24 minutes ago | parent | prev | next [-]

Someone directly down from you suggested looking up Mike Postock's TDD skill, so I did:

https://github.com/mattpocock/skills/blob/main/tdd%2FSKILL.m...

Everything below quoted from that skill, and serves as a much better rebuttal than I had started writing:

DO NOT write all tests first, then all implementation. This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code."

This produces crap tests:

Tests written in bulk test imagined behavior, not actual behavior You end up testing the shape of things (data structures, function signatures) rather than user-facing behavior Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine

You outrun your headlights, committing to test structure before understanding the implementation

Correct approach:

Vertical slices via tracer bullets.

One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.

▲

tnecio 2 hours ago | parent | prev | next [-]

How do you make sure Red Team doesn't just write subtly broken tests?

▲

xienze 5 hours ago | parent | prev | next [-]

This seems like a tremendous amount of planning, babysitting, verification, and token cost just to avoid writing code and tests yourself.

▲

habinero 5 hours ago | parent | next [-]

It's assigning yourself the literal worst parts of the job - writing specs, docs, tests and reading someone else's code.

	▲	zarzavat 34 minutes ago \| parent [-]
		There's a real disconnect. I was talking to a junior developer and they were telling me how Claude is so much smarter than them and they feel inferior. I couldn't relate. From my perspective as a senior, Claude is dumb as bricks. Though useful nonetheless. I believe that if you're substantially below Claude's level then you just trust whatever it says. The only variables you control are how much money you spend, how much markdown you can produce, and how you arrange your agents. But I don't understand how the juniors on HN have so much money to throw at this technology.

▲

gedy 4 hours ago | parent | prev [-]

Yes with the reward of: I don't understand this code and didn't learn anything incrementally about the feature I "planned".

▲

skybrian 7 hours ago | parent | prev [-]

How do you define visibility rules? Is that possible for subagents?

▲

egeozcan 7 hours ago | parent [-]

AFAIK Claude doesn't support it, but if you're willing to go the extra mile, you can get creative with some bash script: https://pastebin.com/raw/m9YQ8MyS (generated this a second ago - just to get the point across )

To be clear, I don't do this. I never saw an agent cheat by peeking or something. I really did look through their logs.

I'd be very interested to see claude code and other tools support this pattern when dispatching agents to be really sure.

▲

achierius 6 hours ago | parent | next [-]

> To be clear, I don't do this.

How do you know that it works then? Are you using a different tool that does support it?

▲

skybrian 6 hours ago | parent | prev [-]

So what do you do? Do you define roles somewhere and tell the agent to assign these roles to subagents?

▲

ssk42 5 hours ago | parent [-]

Fun to see you not on tildes.

Setting up a clean room is one of the only ways to do Evals on agentic harnesses. Especially prevalent with Windsurf which doesn’t have an easy CLI start.

So how? The easiest answer when allowed is docker. Literally new image per prompt. There’s also flags with Claude to not use memory and from there you can use -p to have it just be like a normal cli tool. Windsurf requires manual effort of starting it up in a new dir.

▲

skybrian 3 hours ago | parent [-]

Sounds interesting, but I'm not quite getting the relevance for people writing code with an agent. Should I be doing evals?

	▲	ssk42 38 minutes ago \| parent [-]
		Well I mean yes. I think people ought be aware for how the harnesses compare for their stacks. But clean room applies for this RGR situation too

▲

lagrange77 7 hours ago | parent | prev | next [-]

> Reward hacking is very real and hard to guard against.

Is it really about rewards? Im genuinely curious. Because its not a RL model.

▲

gbnwl 7 hours ago | parent | next [-]

I'm noticing terms related to DL/RL/NLP are being used more and more informally as AI takes over more of the cultural zeitgeist and people want to use the fancy new terms of the era, even if inaccurately. A friend told me he "trained and fine tuned a custom agent" for his work when what he meant was he modified a claude.md file.

	▲	collingreen 25 minutes ago \| parent [-]
		Respectfully, your friend doesn't know what he is talking about and is saying things that just "feel right" (vibe talking??). Which might be exactly how technical terms lose their meaning so perhaps you're exactly right.

▲

hexaga 6 hours ago | parent | prev | next [-]

There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

And with that comes reward hacking - which isn't really about looking for more reward but rather that the model has learned patterns of behavior that got reward in the train env.

That is, any kind of vulnerability in the train env manifests as something you'd recognize as reward hacking in the real world: making tests pass _no matter what_ (because the train env rewarded that behavior), being wildly sycophantic (because the human evaluators rewarded that behavior), etc.

▲

lagrange77 3 hours ago | parent [-]

> There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

Hm, as i understand it, parts of the training of e.g. ChatGPT could be called RL models. But the subject to be trained/fine tuned is still a seq2seq next token predictor transformer neural net.

	▲	hexaga 3 hours ago \| parent [-]
		RL is simply a broad category of training methods. It's not really an architecture per se: modern GPTs are trained first on reconstruction objective on massive text corpora (the 'large language' part), then on various RL objectives +/- more post-training depending on which lab.

▲

magicalist 7 hours ago | parent | prev | next [-]

> Is it really about rewards? Im genuinely curious. Because its not a RL model.

Ha, good point. I was using it informally (you could handwave and call it an intrinsic reward if a model is well aligned to completing tasks as requested), but I hadn't really thought about it.

Searching around, it seems like I'm not alone, but it looks like "specification gaming" is also sometimes used, like: https://deepmind.google/blog/specification-gaming-the-flip-s...

▲

nurettin 7 hours ago | parent | prev [-]

They probably meant goal hacking. (I just made that up)

▲

SoftTalker 7 hours ago | parent | prev [-]

A refactor should not affect the tests at all should it? If it does, it's more than a refactor.

▲

gchamonlive 7 hours ago | parent | next [-]

It can if your refactor needs to deal with interface changes, like moving methods around, changing argument order etc... all these need to propagate to the tests

▲

bluGill 6 hours ago | parent [-]

Your tests are an assertion that 'no matter what this will never change'. If your interface can change then you are testing implementation details instead of the behavior users care about.

the above is really hard. A lot of tdd 'experts' don't understand is and teach fragile tests that are not worth having.

▲

8note 3 hours ago | parent | next [-]

https://www.hyrumslaw.com/

your implementation is your interface. its a bit naive or hating-your-users to assume your tests are what your users care about. theyre dealing with everything, regardless of what youve tested or not.

	▲	SirSavary 2 hours ago \| parent [-]
		Hyrum's law is about the real consumers/users (inadvertently) depending on any observable behaviour they can get their hands on. TDD/BDD tests are meant to define the intended contract of a system. These are not the same thing.

▲

switchbak 4 hours ago | parent | prev [-]

Refactoring is changing the design of the code without affecting the behaviour.

You can change an interface and not change the behaviour.

I have rarely heard such a rigid interpretation such as this.

▲

magicalist 7 hours ago | parent | prev [-]

It depends on what you mean by "refactor" and how exactly you're testing, I guess, but that's not really at the heart of the point. red-green-refactor could also be used for adding new features, for instance, or an entire codebase, I guess.

▲

SequoiaHope 7 hours ago | parent | prev | next [-]

I’m telling it to use red/green tdd [1] and it will write test that don’t fail and then says “ah the issue is already fixed” and then move on. You really have to watch it very closely. I’m having a huge problem with bad tests in my system despite a “governance model” that I always refer it to which requires red/green tdd.

[1] https://simonwillison.net/guides/agentic-engineering-pattern...

▲

joegaebel 3 hours ago | parent | prev | next [-]

I've been able to encode Outside-in Test Driven Development into a repeatable workflow. Claude Code follows it to a T, and I've gotten great results. I've written about it more here, and created a repo people can use out of the box to try it out:

https://www.joegaebel.com/articles/principled-agentic-softwa... https://github.com/JoeGaebel/outside-in-tdd-starter

▲

codybontecou 8 hours ago | parent | prev | next [-]

This sounds interesting. Can you go a bit deeper or provide references on how to implement the green/red/refactor subagent pattern?

▲

elemeno 8 hours ago | parent | next [-]

It’s not an agentic pattern, it’s an approach to test driven development.

You write a failing test for the new functionality that you’re going to add (which doesn’t exist yet, so the test is red). You then write the code until the test passes (that is, goes green).

▲

pastescreenshot 8 hours ago | parent | prev | next [-]

What has worked better for me is splitting authority, not just prompts. One agent can touch app code, one can only write failing tests plus a short bug hypothesis, and one only reviews the diff and test output. Also make test files read only for the coding agent. That cuts out a surprising amount of self-grading behavior.

	▲	huslage 6 hours ago \| parent [-]
		How do you limit access like that?

▲

dworks 5 hours ago | parent | prev | next [-]

I built rlm-workflow which has stage gating, TDD and sub-agent support: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

▲

dmd 8 hours ago | parent | prev [-]

That's the cool bit - you don't have to. CC is perfectly well aware and competent to implement it; just tell it to.

	▲	2 hours ago \| parent \| next [-]
		[deleted]
	▲	irishcoffee 8 hours ago \| parent \| prev [-]
		"So this is how liberty dies... with thunderous applause.” - Padmé Amidala s/liberty/knowledge

▲

osigurdson 6 hours ago | parent | prev | next [-]

So more stuff happens with this approach but how do you know what it generates is correct?

▲

afro88 7 hours ago | parent | prev | next [-]

Good idea, and an improvement, but you still have that fundamental issue: you don't really know what code has been written. You don't know the refactors are right, in alignment with existing patterns etc.

▲

Skidaddle 7 hours ago | parent | prev | next [-]

How exactly do you set up your CC sessions to do this?

▲

aray07 8 hours ago | parent | prev [-]

thats a great idea - i have been using codex to do my code reviews since i have it to give better critique on code written by claude but havent tried it with testing yet!

	▲	darkbatman 7 hours ago \| parent [-]
		codex/gpt is a stubborn model, doubt it would accept claude reviews or counter it. have seen cases where claude is more willing to comply if shared feedback though its just sycophancy too.