▲ | menaerus 7 months ago | ||||||||||||||||
EDIT: sorry for the wall of text but I really find this topic to be quite interesting to discuss. > where I run benchmarks before-and-after but only look at the single relevant function in the profile. That removes a whole lot of noise Yes, but I have always found that approach to be insufficient IME. For example, let's say that function profiling data shows that f(x) improved by X% after my change, however, when I run the E2E system tests the results I get are one of the following: 1. E2E system tests over M different workloads show no difference in performance. The correlation between the change and E2E performance in all M workloads is zero. 2. E2E system tests over M different workloads show that performance improved. The correlation between the change and E2E performance is therefore positive. 3. E2E system tests over M different workloads show that performance degraded. The correlation between the change and E2E performance is negative. IME distribution of probabilities (#1, #2, #3) is ~[.98, .1, .1]. Hypothesis #1: None of the M workloads were sufficient to show that there is a positive or negative correlation between the change and E2E performance. In other words, we haven't found that particular M+1st workload yet that shows that there really is a change in performance. Hypothesis #2: There is simply no correlation between the change and E2E performance as experiment results have shown. Hypothesis #3: our benchmark measurement is insufficient to catch the change. Resolution might be lacking. Precision might be lacking. Accuracy also. I find hypothesis #2 to be the most probable when experiment results are repeatable (precision). This also means that the majority of changes that we developers are doing for the sake of "optimization gains" can be easily disproved. E.g. you could have done 10s or 100s of "small optimizations" but yet there is no measurable impact on the E2E runtime performance. > The biggest hurdle as I see it is really that we don't have something like STABILIZER; if you're measuring a full binary, you're very likely that issues like code moving around cause you to measure completely different things from what you intended, and we have pretty much no way of countering that currently. I agree and I see this is a problem of hard-coding all the random variables in our system. Otherwise, we don't have the same initial conditions for each experiment run, which in reality we really don't. And random variable is pretty much everything. Compiler. Linker. Two consecutive builds of the same binary do not necessarily produce the same binary, e.g. code layout may change. Kernel has a state. Filesystem has a state. Our NVME drives have a state. Then there is a page cache. I/O scheduler. Task scheduler. NUMA. CPU throttling. So, there's a bunch of multidimensional random variables spread across the time all of which impact the experiment results - a stochastic process by definition. | |||||||||||||||||
▲ | Sesse__ 7 months ago | parent [-] | ||||||||||||||||
> E.g. you could have done 10s or 100s of "small optimizations" but yet there is no measurable impact on the E2E runtime performance. My experience actually diverges here. I've had cases where I've done a bunch of optimizations in the 0.5% range, and then when you go and benchmark the system against the version that was three months ago, you actually see a 20% increase in speed. Of course, this is on a given benchmark which you have to hope is representative; it's impossible to say exactly how every user is going to be in the wild. But if you accept that the goal is to do better on a given E2E benchmark, it absolutely is possible (and again, see SQLite here). But you have to sometimes be able to distinguish between hope and what the numbers are telling you; it really sucks when you have an elegant optimization and you just have to throw it in the bin after a week because the numbers just don't agree with you. :-) | |||||||||||||||||
|