▲ | menaerus 7 months ago | ||||||||||||||||||||||||||||||||||||||||
I agree and I am totally happy to say "I tried to measure but the result I found is inconclusive" or "I believe that this is at worst neutral commit - e.g. it won't add any regression". Having spent probably thousands of hours e2e benchmarking the code I wrote, I'm always skeptical about the benchmarking frameworks, blogs, etc. The last one being the paper from Meta, where they claim that they can detect 0.005% regressions. I really don't think this is possible in sufficiently complex e2e system tests. IME I found it to be extremely challenging to detect regressions, with high confidence, that are below 5%. | |||||||||||||||||||||||||||||||||||||||||
▲ | Sesse__ 7 months ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||
It really depends on your benchmark and how much bias you're willing to trade for your variance. I mean, SQLite famously uses Callgrind and claims to be able to measure 0.005%; which they definitely can, but only in the CPU that Callgrind simulates, which may or may not coincide with reality. Likewise, I've used similar strategies to the one Meta describes, where I run benchmarks before-and-after but only look at the single relevant function in the profile. That removes a whole lot of noise (I've reliably found -- and later verified by other means -- 0.2% wins in large systems), but won't catch cases like e.g. large-scale code bloat. The biggest hurdle as I see it is really that we don't have something like STABILIZER; if you're measuring a full binary, you're very likely that issues like code moving around cause you to measure completely different things from what you intended, and we have pretty much no way of countering that currently. And the more locked you are to a single code path (i.e., your histogram is very skewed), the worse these issues are. | |||||||||||||||||||||||||||||||||||||||||
|