Remix.run Logo
max-t-dev 4 hours ago

Yeah fair point. The benchmark is single-run per arm-prompt pair, so the variance finding on safety categories could be noise rather than signal. Findings doc flags this for the score deltas (anything under 0.02 between arms is in the judge's noise floor) but I should have applied the same caveat to the per-question token variance, which I didn't. Will read the lambda variance write-up. Multi-trial with cost classification is the right direction. The single-shot harness was deliberately scoped for a clean compression-only comparison before adding turns or trials, but you're right that without trials the variance findings aren't as solid. Thanks for the reply.

dataviz1000 2 hours ago | parent [-]

I'm trying to wrap my mind around this. Anything you explore and share is awesome. Thanks for the blog post.

If you want to test it across coding tasks, have a look at https://github.com/adam-s/testing-claude-agent