Remix.run Logo
armanj 3 hours ago

How reliable is this uptime? and why it's sooo different from gh's official status numbers?

xnorswap 3 hours ago | parent | next [-]

Their headline figure is a bit exaggerated, it's driven from the official status numbers, but aggregates across all GH services.

Imagine you run 365 services, and each goes down 1 day a year.

If those are all on the same day, this would report you having 99.7% uptime.

If instead, each service goes down 1 day per year but on different days, this would report you having 0% uptime.

Despite the same actual downtime for any given service.

The truth is somewhere in the middle, that github has run degraded for a significant amount of time.

But I don't think it is fair to take an incident like this one[1], where 5% of requests were incorrectly denied authorisation, and count it the same as you would the whole of github being down.

[1] https://www.githubstatus.com/incidents/02z04m335tvv

dijit 29 minutes ago | parent [-]

yeah, it's a hard problem to accurately tell people a reliablity number.

Rachel famously wrote about this in "Your nines are not my nines"[0].

The truth is though, that some systems depend on others. Actions being down means you don't merge code or release: but you know... git operations being unavailable has the same effect. It's meaningless to separate the two.

So it depends on the framing.

[0]: https://rachelbythebay.com/w/2019/07/15/giant/

dspillett 3 hours ago | parent | prev | next [-]

> How reliable is this uptime?

IT seems to be quoting incident reports for the duration of each outage, so there is accountability in terms of being able to verify all the details of what they are counting.

> and why it's sooo different from gh's official status numbers?

Maybe this is counting any period with any service showing any level of issue as a complete fail, and the official numbers are cherry-picking a bit (only counting core services? not counting significant performance issues that the other count does because things were working, just v…e…r…y … s…l…o…w…l…y) or averaging values (so 75% services running at a given time looks ¼ as bad in their figures), the two sets of calculations could be done with a different granularity, …

In other words: lies, damned lies, and statistics!

The only way to know is to know how both are calculated in detail, and that information might not be readily available.

fridder 3 hours ago | parent | prev | next [-]

There is a link to the repo to verify the code and explain their process

datadrivenangel 3 hours ago | parent | prev [-]

1. This one counts downtime from any service, so if anything is down or degraded they count it as 100% down, which is harsh.

2. Github is doing some classic big org sneaky things where they don't count degraded service fully. So if github actions is partially down for most people in a away that makes you say "github is down", there's a good chance that microsoft doesn't count that or counts it partially instead.

xvilka 3 hours ago | parent [-]

> Github is doing some classic big org sneaky things where they don't count degraded service fully.

Even worse example is the Travis CI. For more than a year their CI jobs sometimes get stuck or do not start for days, and, surprise-surprise, it's never shown at their status page[1] - always green. We would switch to something else entirely if not the unique offering of PowerPC and SystemZ servers/runners. Apart from that - it's the worst CI service I used so far.

[1] https://www.traviscistatus.com/history