Remix.run Logo
sfink 9 hours ago

> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.

I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.

rs545837 8 hours ago | parent [-]

Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.

hauntsaninja 3 hours ago | parent [-]

Yep, you can run bayesect to an arbitrary confidence level.

This script in the repo https://github.com/hauntsaninja/git_bayesect/blob/main/scrip... will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)

For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests

rs545837 2 hours ago | parent [-]

You're right, at 300 tests bayesect converges to ~97-100% across the board. I reran with calibration.py and confirmed.

Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:

128 commits, 500 trials:

Budget=50, 70/30: uniform 22% → graph 33% Budget=50, 80/20: uniform 71% → graph 77% Budget=100, 70/30: uniform 56% → graph 65% At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.

Script: https://gist.github.com/rs545837/b3266ecf22e12726f0d55c56466...