> In your 5 sample example, you can't determine if there are any outliers. You need more samples

I think the same issue is present no matter how many samples we collect. Statistical apparatus of choice may indeed tell us that given sample is an outlier in our experiment setup but what I am wondering is what if the sample was an actual signal that we measured and not noise.

Concrete example: in 10/100 test-runs you see a regression of 10%. The rest of the test-runs show 3%. You can 10x or 100x that example if you wish. Are those 10% regressions the outliers because "the environment was noisy" or did our code really run slower for whatever conditions/reasons in those experiments?

> Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.

In theory yes, and for sufficiently large N (samples). Sometimes you cannot afford to reach this "sufficiently large N" condition.

▲

Sesse__ 7 months ago | parent [-]

> In theory yes, and for sufficiently large N (samples). Sometimes you cannot afford to reach this "sufficiently large N" condition.

I think at that point, we should get better at saying “OK, we just don't know”. If I can't show within reasonable resource spend that my optimization is worthwhile, then perhaps don't add it. (Of course, it depends on how much it uglifies the code, whether I have some external reason to believe it's better, and so on. But in general, people tend to overestimate how much anything of anything will help :-) )

▲

menaerus 7 months ago | parent [-]

I agree and I am totally happy to say "I tried to measure but the result I found is inconclusive" or "I believe that this is at worst neutral commit - e.g. it won't add any regression". Having spent probably thousands of hours e2e benchmarking the code I wrote, I'm always skeptical about the benchmarking frameworks, blogs, etc.

The last one being the paper from Meta, where they claim that they can detect 0.005% regressions. I really don't think this is possible in sufficiently complex e2e system tests. IME I found it to be extremely challenging to detect regressions, with high confidence, that are below 5%.

Link: https://tangchq74.github.io/FBDetect-SOSP24.pdf

▲

Sesse__ 7 months ago | parent [-]

It really depends on your benchmark and how much bias you're willing to trade for your variance. I mean, SQLite famously uses Callgrind and claims to be able to measure 0.005%; which they definitely can, but only in the CPU that Callgrind simulates, which may or may not coincide with reality. Likewise, I've used similar strategies to the one Meta describes, where I run benchmarks before-and-after but only look at the single relevant function in the profile. That removes a whole lot of noise (I've reliably found -- and later verified by other means -- 0.2% wins in large systems), but won't catch cases like e.g. large-scale code bloat.

The biggest hurdle as I see it is really that we don't have something like STABILIZER; if you're measuring a full binary, you're very likely that issues like code moving around cause you to measure completely different things from what you intended, and we have pretty much no way of countering that currently. And the more locked you are to a single code path (i.e., your histogram is very skewed), the worse these issues are.

▲

7 months ago | parent | next [-]

[deleted]

▲

menaerus 7 months ago | parent | prev [-]

EDIT: sorry for the wall of text but I really find this topic to be quite interesting to discuss.

> where I run benchmarks before-and-after but only look at the single relevant function in the profile. That removes a whole lot of noise

Yes, but I have always found that approach to be insufficient IME.

For example, let's say that function profiling data shows that f(x) improved by X% after my change, however, when I run the E2E system tests the results I get are one of the following:

1. E2E system tests over M different workloads show no difference in performance. The correlation between the change and E2E performance in all M workloads is zero.

2. E2E system tests over M different workloads show that performance improved. The correlation between the change and E2E performance is therefore positive.

3. E2E system tests over M different workloads show that performance degraded. The correlation between the change and E2E performance is negative.

IME distribution of probabilities (#1, #2, #3) is ~[.98, .1, .1].

Hypothesis #1: None of the M workloads were sufficient to show that there is a positive or negative correlation between the change and E2E performance. In other words, we haven't found that particular M+1st workload yet that shows that there really is a change in performance.

Hypothesis #2: There is simply no correlation between the change and E2E performance as experiment results have shown.

Hypothesis #3: our benchmark measurement is insufficient to catch the change. Resolution might be lacking. Precision might be lacking. Accuracy also.

I find hypothesis #2 to be the most probable when experiment results are repeatable (precision).

This also means that the majority of changes that we developers are doing for the sake of "optimization gains" can be easily disproved. E.g. you could have done 10s or 100s of "small optimizations" but yet there is no measurable impact on the E2E runtime performance.

> The biggest hurdle as I see it is really that we don't have something like STABILIZER; if you're measuring a full binary, you're very likely that issues like code moving around cause you to measure completely different things from what you intended, and we have pretty much no way of countering that currently.

I agree and I see this is a problem of hard-coding all the random variables in our system. Otherwise, we don't have the same initial conditions for each experiment run, which in reality we really don't.

And random variable is pretty much everything. Compiler. Linker. Two consecutive builds of the same binary do not necessarily produce the same binary, e.g. code layout may change. Kernel has a state. Filesystem has a state. Our NVME drives have a state. Then there is a page cache. I/O scheduler. Task scheduler. NUMA. CPU throttling.

So, there's a bunch of multidimensional random variables spread across the time all of which impact the experiment results - a stochastic process by definition.

▲

Sesse__ 7 months ago | parent [-]

> E.g. you could have done 10s or 100s of "small optimizations" but yet there is no measurable impact on the E2E runtime performance.

My experience actually diverges here. I've had cases where I've done a bunch of optimizations in the 0.5% range, and then when you go and benchmark the system against the version that was three months ago, you actually see a 20% increase in speed.

Of course, this is on a given benchmark which you have to hope is representative; it's impossible to say exactly how every user is going to be in the wild. But if you accept that the goal is to do better on a given E2E benchmark, it absolutely is possible (and again, see SQLite here). But you have to sometimes be able to distinguish between hope and what the numbers are telling you; it really sucks when you have an elegant optimization and you just have to throw it in the bin after a week because the numbers just don't agree with you. :-)

▲

menaerus 7 months ago | parent [-]

> My experience actually diverges here. I've had cases where I've done a bunch of optimizations in the 0.5% range, and then when you go and benchmark the system against the version that was three months ago, you actually see a 20% increase in speed.

Yeah, not IME really. First, I don't know how to measure at 0.5% resolution reliably. Second, this would imply that YoY we should be able to see [~20, ~20+x]% of performance runtime improvement of software we are working on and this doesn't resemble my experience at all - it's usually vice-versa and it's mostly about "how to add new feature without making the rest of this ~5 MLoC software regress". Big optimization wins were quite rare.

Amdahl's law says that "overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used" so throwing a bunch of optimizations at the code does not, for most cases, result with the overall improvement. I can replace my notorious use of std::unordered_map with absl::flat_hash_map and still see no improvement at all.

> it really sucks when you have an elegant optimization and you just have to throw it in the bin after a week because the numbers just don't agree with you. :-)

It really does and I've been there many times. I however learned to understand this as "I have a code of what I thought it should improve our runtime but I found no signal that will support my theory". This automatically makes such changes difficult to merge especially considering that most optimizations aren't of "clean code" practice.

	▲	Sesse__ 7 months ago \| parent [-]
		I see Amdahl's Law as an opportunity, not a limit. :-) If you optimize something, it means the remainder is now even more valuable to optimize, percentage-wise. In a way like compound interest.