| ▲ | perching_aix 5 hours ago | |
I think you got it, but let me maybe lay it out more explicitly with a specific example. I visit HN, that's one request. But I visit HN multiple times a day. So for the operation that serves the homepage, if you took e.g. a past 24hr latency p99 chart, the number of requests analyzed would not be the same as the number of unique users involved in making those requests, potentially drastically so. So you might see a p99 you're comfortable with, and conclude that since only 1% of requests were worse than that, it's fine. In practice though, depending on how "well-trodden" that operation is, you might very well be in a situation where all users experienced at least one such beyond-SLO event that day, since the mapping is many to one. The cross operation version of this is important as well, yes. You can have users experience snags across common flows too for example, same idea. Regarding methodology, it's nothing special, I just rely on user IDs and correlation IDs. It really is just a perspective shift, the underlying data is the same. You can even calculate back the "number of nines needed to get an acceptable UX" using this, as long as the general usage habits are stable. It's just gonna be a lot more nines than two in my experience. | ||