▲ | Sesse__ 7 months ago | |||||||||||||||||||||||||
It really depends on your benchmark and how much bias you're willing to trade for your variance. I mean, SQLite famously uses Callgrind and claims to be able to measure 0.005%; which they definitely can, but only in the CPU that Callgrind simulates, which may or may not coincide with reality. Likewise, I've used similar strategies to the one Meta describes, where I run benchmarks before-and-after but only look at the single relevant function in the profile. That removes a whole lot of noise (I've reliably found -- and later verified by other means -- 0.2% wins in large systems), but won't catch cases like e.g. large-scale code bloat. The biggest hurdle as I see it is really that we don't have something like STABILIZER; if you're measuring a full binary, you're very likely that issues like code moving around cause you to measure completely different things from what you intended, and we have pretty much no way of countering that currently. And the more locked you are to a single code path (i.e., your histogram is very skewed), the worse these issues are. | ||||||||||||||||||||||||||
▲ | 7 months ago | parent | next [-] | |||||||||||||||||||||||||
[deleted] | ||||||||||||||||||||||||||
▲ | menaerus 7 months ago | parent | prev [-] | |||||||||||||||||||||||||
EDIT: sorry for the wall of text but I really find this topic to be quite interesting to discuss. > where I run benchmarks before-and-after but only look at the single relevant function in the profile. That removes a whole lot of noise Yes, but I have always found that approach to be insufficient IME. For example, let's say that function profiling data shows that f(x) improved by X% after my change, however, when I run the E2E system tests the results I get are one of the following: 1. E2E system tests over M different workloads show no difference in performance. The correlation between the change and E2E performance in all M workloads is zero. 2. E2E system tests over M different workloads show that performance improved. The correlation between the change and E2E performance is therefore positive. 3. E2E system tests over M different workloads show that performance degraded. The correlation between the change and E2E performance is negative. IME distribution of probabilities (#1, #2, #3) is ~[.98, .1, .1]. Hypothesis #1: None of the M workloads were sufficient to show that there is a positive or negative correlation between the change and E2E performance. In other words, we haven't found that particular M+1st workload yet that shows that there really is a change in performance. Hypothesis #2: There is simply no correlation between the change and E2E performance as experiment results have shown. Hypothesis #3: our benchmark measurement is insufficient to catch the change. Resolution might be lacking. Precision might be lacking. Accuracy also. I find hypothesis #2 to be the most probable when experiment results are repeatable (precision). This also means that the majority of changes that we developers are doing for the sake of "optimization gains" can be easily disproved. E.g. you could have done 10s or 100s of "small optimizations" but yet there is no measurable impact on the E2E runtime performance. > The biggest hurdle as I see it is really that we don't have something like STABILIZER; if you're measuring a full binary, you're very likely that issues like code moving around cause you to measure completely different things from what you intended, and we have pretty much no way of countering that currently. I agree and I see this is a problem of hard-coding all the random variables in our system. Otherwise, we don't have the same initial conditions for each experiment run, which in reality we really don't. And random variable is pretty much everything. Compiler. Linker. Two consecutive builds of the same binary do not necessarily produce the same binary, e.g. code layout may change. Kernel has a state. Filesystem has a state. Our NVME drives have a state. Then there is a page cache. I/O scheduler. Task scheduler. NUMA. CPU throttling. So, there's a bunch of multidimensional random variables spread across the time all of which impact the experiment results - a stochastic process by definition. | ||||||||||||||||||||||||||
|