Remix.run Logo
kqr 6 days ago

It might be a lie, but it surely is a practical one. In my brief foray into site reliability engineering I used CPU utilisation (of CPU-bofund tasks) with queueing theory to choose how to scale servers before big events.

The %CPU suggestions ran contrary to (and were much more conservative than) the "old wisdom" that would otherwise have been used. It worked out great at much lower cost than otherwise.

What I'm trying to say is you shouldn't be afraid of using semi-crappy indicators just because they're semi-crappy. If it's the best you got it might be good enough anyway.

In the case of CPU utilisation, though, the number in production shouldn't go above 40 % for many reasons. At 40 % there's usually still a little headroom. The mistake of the author was not using fundamentals of queueing theory to avoid high utilisation!

therealdrag0 6 days ago | parent | next [-]

> semi-crappy indicator … good enough.

Agree. Another example of this is for metrics as percentiles per host that you have to average, vs histograms per host that get percentile calculated at aggregation time among hosts. Sure an avg/max of a percentile is technically not a percentile, but in practice switching between one or the other hasn’t affected my operations at all. Yet I know some people are adamant about mathematical correctness as if that translates to operations.

arccy 6 days ago | parent [-]

That works ok when you have evenly distributed load (which you want / would hope to have), much less so when your workload is highly unbalanced.

mayama 6 days ago | parent | prev | next [-]

Combination of CPU% and loadavg would generally tell how system is doing. I had systems where loadavg is high, waiting on network/io, but little cpu%. Tracing high load is not always straightforward as cpu% though, you have to go through io%, net%, syscalls etc.

saagarjha 6 days ago | parent | prev | next [-]

40% seems quite lightly utilized tbh

cpncrunch 6 days ago | parent | next [-]

I tend to use 50% as a soft target, which seems like a good compromise. Sometimes it may go a little bit over that, but if it's occasional it shouldn't be an issue.

It's not good to go much over 50% on a server (assuming half the cpus are just hyperthreads), because you're essentially relying on your load being able to share the actual cpu cores. At some point, when the load increases too much, there may not be any headroom left for sharing those physical cpus. You then get to the point where adding a little bit more load to 80% suddenly results in 95% utilization.

kqr 6 days ago | parent | prev | next [-]

It depends on how variable the load is, compared to how fast the servers can scale up and down, etc. I often have as a rule of thumb to have enough headroom to be able to deal with twice the load while staying within a triple of the response time. You can solve the equations for your specific case, but eyeballing graphs such as [1] I end up somewhere in the area of 40 %.

The important part is of course to ask yourself the question "how much increased load may I need to handle, and how much can I degrade system performance in doing so?" You may work in an industry that only ever sees 10 % additional load at timescales where scaling is unfeasible, and then you can pick a significantly higher normal utilisation level. Or maybe you're in an industry where you cannot degrade performance by more than 10 % even if hit by five times the load – then you need a much, much more conservative target for utilisation.

[1]: https://erikbern.com/assets/wait-time-2.png

paravz 5 days ago | parent | prev [-]

Cpu utilization %% needs to be contrasted with a "business" metric like latency or RPS. Depending on the environment and hardware 40% can be too utilized or way underutilized

zekrioca 6 days ago | parent | prev [-]

I noticed exactly the same thing. The author is saying something that has been repeatedly written in queueing theory books for decades, still they are noticing this only now.