Remix.run Logo
solatic 3 hours ago

I think there's two kinds of software-producing-organizations:

There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.

Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.

I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

ffsm8 24 minutes ago | parent [-]

> SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

According to the specified goals of SRE, this is actually not just a small fraction - but something that shouldn't happen. To be clear, I'm fully aware that this will always be necessary - but whenever it happened - it's because the site reliability engineer (SRE) overlooked something.

Hence if that's considered a large part of the job.. then you're just not a SRE as Google defined that role

https://sre.google/sre-book/table-of-contents/