Hey thanks for your review, a paper author here.

Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.

▲

sdenton4 19 minutes ago | parent | next [-]

You're measuring binary outcomes, so you can use a beta distribution to understand the distribution of possible success rates given your observations, and thereby provide a confidence interval on the observed success rates. This week help us see whether that 4% success rate is statistically significant, or if it is likely to be noise.

▲

vidarh 3 hours ago | parent | prev [-]

Without measuring quality of output, this seems irrelevant to me.

My use of CLAUDE.md is to get Claude to avoid making stupid mistakes that will require subsequent refactoring or cleanup passes.

Performance is not a consideration.

If anything, beyond CLAUDE.md I add agent harnesses that often increase the time and tokens used many times over, because my time is more expensive than the agents.

	▲	_joel 2 hours ago \| parent \| next [-]
		CLAUDE.md isn't a silver bullet either, I've had it lose context a couple of questions deep. I do like GSD[1] though, it's been a great addition to the stack. I also use multiple, different LLMs as a judge for PRs, which captures a load of issues too. [1] https://github.com/gsd-build/get-shit-done
	▲	yorwba 42 minutes ago \| parent \| prev [-]
		In this context, "performance" means "does it do what we want it to do" not "does it do it quickly". Quality of output is what they're measuring, speed is not a consideration.