You also want to hit him with queueing theory.

Up to a hair over 60% utilization the queuing delays on any work queue remain essentially negligible. At 70 they become noticeable, and at 80% they've doubled. And then it just turns into a shitshow from there on.

The rule of thumb is 60% is zero, and 80% is the inflection point where delays go exponential.

The biggest cluster I ran, we hit about 65% CPU at our target P95 time, which is pretty much right on the theoretical mark.

▲

BrendanLong 6 days ago | parent | next [-]

A big part of this is that CPU utilization metrics are frequently averaged over a long period of time (like a minute), but if your SLO is 100 ms, what you care about is whether there's any ~100 ms period where CPU utilization is at 100%. Measuring p99 (or even p100) CPU utilization can make this a lot more visible.

▲

hinkley 6 days ago | parent | next [-]

The vertical for this company was one where the daily traffic was oddly regular. That the two lines matched expectations likely has to do with the smoothness of the load.

The biggest problem was not variance in request rate it was variance in request cost, which is usually where queuing kicks in, unless you're being dumb about things. I think for a lot of apps p98 is probably a better metric to chase, p99 and p100 are useful for understanding your application better, but I'm not sure you want your bosses to fixate on them.

But our contracts were for p95, which was fortunate given the workload, or at least whoever made the contracts got good advice from the engineering team.

▲

kccqzy 6 days ago | parent | prev [-]

If your SLO is 100 ms you need far more granular measurement periods than that. You should measure the p99 or p100 utilization for every 5-ms interval or so.

	▲	6 days ago \| parent [-]
		[deleted]

▲

Ambroisie 6 days ago | parent | prev [-]

Do you have a link to a more in-depth analysis of the queuing theory for these numbers?

	▲	hinkley 6 days ago \| parent [-]
		I can picture charts from various treatments in my head but none of the names stick. I really should have a favorite couple of links or books but unfortunately I do not. I will put that on my todo list. The magic search terms are “queue size/length”, “utilization”.