Remix.run Logo
dannyw 2 days ago

Wouldn't you need to re-run across lots of samples (even for a single eval/bench) to avoid outsized impacts from just bad luck?