| ▲ | visarga 4 hours ago | |
I did this too, ablating all the components in my coding agent harness. The insight from my meta-optimization loops was "have judge agents review the plan and implementation". One of my own insights here is that you need to collect not just execution traces, but all the human-in-the-loop nudges and steering commands. They are one of the purest sources of feedback on coding agents when seen in context. I agree with OP on the need to collect traces and compare them, not just scores. It is a much richer source of feedback. If anyone is interested I have a slide deck about my approach: https://horiacristescu.github.io/claude-playbook-plugin/docs... | ||