|
| ▲ | Aurornis 7 hours ago | parent | next [-] |
| I’ve written before on HN about when my employer hired several ex-FAANG people to manage all things cloud in our company. Whenever there was an outage they would put up a fight against anyone wanting to update the status page to show the outage. They had so many excuses and reasons not to. Eventually we figured out that they were planning to use the uptime figures for requesting raises and promos as they did at their FAANG employer, so anything that reduced that uptime number was to be avoided at all costs. |
| |
| ▲ | cj 7 hours ago | parent [-] | | Are there companies that actually use their statuspage as a source of truth for uptime numbers? I think it's way more common for companies to have a public status page, and then internal tooling that tracks the "real" uptime number. (E.g. Datadog monitors, New Relic monitoring, etc) (Your point still stands though.) | | |
| ▲ | Aurornis 7 hours ago | parent | next [-] | | I don’t know, but I will say that this team that was hired into our company was so hyperfocused on any numbers they planned to use for performance reviews that it probably didn’t matter which service you chose to measure the website performance. They’d find a way to game it. If we had used the internal devops observability tools I bet they would have started pulling back logging and reducing severity levels as reported in the codebase. It’s obviously not a problem at every company because there are many companies who will recognize these shenanigans and come down hard on them. However you could tell these guys could recognize any opportunity to game the numbers if they thought those numbers would come up at performance review time. Ironically our CEO didn’t even look at those numbers. He used the site and remembered the recent outages. | |
| ▲ | darccio 7 hours ago | parent | prev [-] | | [Datadog employee here] https://updog.ai tracks the uptime of multiple services by real impact across Datadog customers. |
|
|
|
| ▲ | mvkel 6 hours ago | parent | prev | next [-] |
| It's because if you automate it, something could/would happen to the little script that defines "uptime," and if that goes down, suddenly you're in violation of your SLA and all of your customers start demanding refunds/credits/etc. when everything is running fine. Or let's say your load balancer croaks, triggering a "down" status, but it's 3am, so a single server is handling traffic just fine? In short, defining "down" in an automated way is just exposing internal tooling unnecessarily and generates more false positives than negatives. Lastly, if you are allowed 45 minutes of downtime per year and it takes you an hour to manually update the status page, you just bought yourself an extra hour to figure out how to fix the problem before you have to start issuing refunds/credits. |
| |
| ▲ | dogleash 3 hours ago | parent [-] | | >you just bought yourself an extra hour to figure out how to fix the problem before you have to start issuing refunds/credits No. Not if you're not defrauding your customers, you didn't. |
|
|
| ▲ | skywhopper 9 hours ago | parent | prev | next [-] |
| At some level, the status updates have to be manual. Any automation you try to build on top is inevitably going to break in a crisis situation. |
| |
| ▲ | pimterry 8 hours ago | parent | next [-] | | I found GitHub's old "how many visits to this status page have there been recently" graph on their status page to be an absurdly neat solution to this. Requires zero insight into other infrastructure, absolutely minimal automation, but immediately gives you an idea whether it's down for just you or everybody. Sadly now deceased. | | |
| ▲ | Kodiack 8 hours ago | parent | next [-] | | I like that https://discordstatus.com/ shows the API response times as well. There's times where Discord will seem to have issues, and those correlate very well with increased API response times usually. Reddit Status used to show API response times way back in the day as well when I used to use the site, but they've really watered it down since then. Everything that goes there needs to be manually put in now AFAIK. Not to mention that one of the few sections is for "ads.reddit.com", classic. | |
| ▲ | tom1337 5 hours ago | parent | prev [-] | | https://steamstat.us still has this - while not official it's pretty nice. |
| |
| ▲ | mlrtime 8 hours ago | parent | prev | next [-] | | They are manual AND political (depending on how big the company is). Because having a dashboard go to red usually has a bunch of project work behind it. | |
| ▲ | sjsdaiuasgdia 8 hours ago | parent | prev [-] | | Yeah, this is something people think is super easy to automate, and it is for the most basic implementation of something like a single test runner. The most basic implementation is prone to false positives, and as you say, breaking when the rest of your stuff breaks. You can put your test runner on different infrastructure, and now you have a whole new class of false positives to deal with. And it costs you a bit more because you're probably paying someone for the different infra. You can put several test runners on different infrastructure in different parts of the world. This increases your costs further. The only truly clear signals you get from this are when all are passing or all are failing. Any mixture of passes and fails has an opportunity for misinterpretation. Why is Sydney timing out while all the others are passing? Is that an issue with the test runner or its local infra, or is there an internet event happening (cable cut, BGP hijack, etc) beyond the local infra? And thus nearly everyone has a human in the loop to interpret the test results and make a decision about whether to post, regardless of how far they've gone with automation. |
|
|
| ▲ | bnjm 9 hours ago | parent | prev [-] |
| SLA breaches have consequences, no big conspiracy there |
| |
| ▲ | markild 9 hours ago | parent [-] | | Not at all saying it's a conspiracy, I just think it's a lack of transparency. I get why, but it would give me more confidence if they would tell me about everything. | | |
| ▲ | mewpmewp2 8 hours ago | parent | next [-] | | I guess a dirty little secret might be that something is always acting up or being noisy and it would spam the status page completely. | |
| ▲ | zulban 8 hours ago | parent | prev [-] | | They don't make more money by giving you more confidence in their systems. |
|
|