Remix.run Logo
brendanmc6 3 hours ago

Sure happy to share. It’s been trial and error, but I’ve learned that for agents to reliably ship a large feature or refactor, I need a good spec (functional acceptance criteria) and I need a good plan for sequencing the work.

The big expensive models are great at planning tasks and reviewing the implementation of a task. They can better spot potential gotchas, performance or security gaps, subtle logic and nuance that cheaper models fail to notice.

The small cheap models are actually great (and fast) at generating decent code if they have the right direction up front.

So I do all the spec writing myself (with some LLM assistance), and I hand it to a Supervisor agent who coordinates between subagents. Plan -> implement -> review -> repeat until the planner says “all done”.

I switch up my models all the time (actively experimenting) but today I was using GPT 5.4 for review and planning, costing me about $0.4-$1 for a good sized task, and Kimi for implementation. Sometimes my spec takes 4-5 review loops and the cost can add up over an 8 hour day. Still cheaper than Claude Max (for now, barely).

Each agent retains a fairly small context window which seems to keep costs down and improves output. Full context can be catastrophic for some models.

As for the spec writing, this is the fun part for me, and I’ve been obsessing over this process, and the process of tracking acceptance criteria and keeping my agents aligned to it. I have a toolkit cooking, you can find in my comment history (aiming to open source it this week).