Remix.run Logo
compumike 3 hours ago

Re: "page for all 500s": there's a world of difference between "page me with a critical alert at 3am" and "notify me on Monday morning when my normal workday starts". At the extremes:

If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!

If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.

Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.

wasmitnetzen 3 hours ago | parent [-]

Shouldn't Github be large enough to not have anyone on-call, but just rotate the responsible team around the world?

alexfoo 40 minutes ago | parent | next [-]

One team can't troubleshoot AND FIX every possible subsystem, so you just end up with lots (growing to hundreds) of people "on-call" anyway.

As others have said, follow-the-sun type models do exist, usually staffed by people in their normal working hours (EMEA, Americas, APAC) but this means you've still got to cover the weekend and public holidays (which there are a lot of when you factor in plenty of different countries).

Where you need a quick response you can have a core ops/noc team that looks at things with lower thresholds and shorter windows, and their job is to do the initial triage and then page the appropriate team earlier than they would have been alerted by their own alert thresholds/monitoring.

Actually clicking the button to change the status on a public status page is a whole different topic that becomes very political in certain companies.

bobthepanda 3 hours ago | parent | prev [-]

At least when I worked at a Bigcorp a lot of that was being cut to save costs.

lokar an hour ago | parent [-]

I've worked in large orgs where we could (at at some times did) have around the world rotations. They don't work well. It've very hard to maintain real team cohesion, and you end up with really superficial operations. People tend not to dig in really deep, find good fixes, etc. Lots of superficial bandages.