Remix.run Logo
coldfloor 11 hours ago

I was an SRE at Yahoo until around the end of 2024. Not sure if things have changed - last I heard my former team had been laid off - but when I was there it was pretty easy. We had three tiers in the org, with increasing specificity and expertise: Operations Center -> SRE -> Product Engineers.

The OC collectively monitored everything across the company. Each alert that paged had an associated runbook. If they couldn't clear the alert with the runbook, they'd escalate to the SRE responsible for the alerting server/component. Our job was essentially to fix anything that broke that OC couldn't solve. For my domain this often just came down to basic Linux troubleshooting, but sometimes would actually involve specific knowledge about our component. For others (e.g. networking) I imagine the ratio of domain-specific-knowledge problems was higher.

If we determined something was fundamentally broken, like someone pushed an update and now the service won't start, we'd escalate that to PE. PE did a lot of what I think falls under SRE purview at other places: Managing deployments, building out infrastructure, etc. At Yahoo we were really just "tier 2 ops."

We'd also be paged for outages if our service went down or another team was blaming our service for their outage. The job here was essentially the same thing, just with more pressure and people yelling at you; or arguing and trying to prove your stuff was working, please find someone else to blame. If we were involved in an outage, we'd also have to join the "post mortem" (I'll never be able to say that without air quotes) and help with RCA/take on remediation tasks.

Secondarily, we created the monitoring/alerts that went to OC and wrote and maintained their runbooks. In our downtime we were also supposed to do simple automation/scripting to help us or OC with redundant tasks. Sometimes I think I made useful stuff, but often this felt like self-imposed busy work, because we always - especially under Marissa's stack ranking regime - had to demonstrate that we were doing more than just our job. I swear one quarter between us and OC we ended up with like 10 redundant Slack bots because everyone was rushing to make something to pad their review with.