A big part of this is that CPU utilization metrics are frequently averaged over a long period of time (like a minute), but if your SLO is 100 ms, what you care about is whether there's any ~100 ms period where CPU utilization is at 100%. Measuring p99 (or even p100) CPU utilization can make this a lot more visible.

▲

hinkley 6 days ago | parent | next [-]

The vertical for this company was one where the daily traffic was oddly regular. That the two lines matched expectations likely has to do with the smoothness of the load.

The biggest problem was not variance in request rate it was variance in request cost, which is usually where queuing kicks in, unless you're being dumb about things. I think for a lot of apps p98 is probably a better metric to chase, p99 and p100 are useful for understanding your application better, but I'm not sure you want your bosses to fixate on them.

But our contracts were for p95, which was fortunate given the workload, or at least whoever made the contracts got good advice from the engineering team.

▲

kccqzy 6 days ago | parent | prev [-]

If your SLO is 100 ms you need far more granular measurement periods than that. You should measure the p99 or p100 utilization for every 5-ms interval or so.

	▲	6 days ago \| parent [-]
		[deleted]