| ▲ | 0xbadcafebee 6 hours ago | ||||||||||||||||
> it sure feels like software has become a brittle mess, with 98% uptime becoming the norm instead of the exception, including for big services As somebody who has been running systems like these for two decades: the software has not changed. What's changed is that before, nobody trusted anything, so a human had to manually do everything. That slowed down the process, which made flaws happen less frequently. But it was all still crap. Just very slow moving crap, with more manual testing and visual validation. Still plenty of failures, but it doesn't feel like it fails a lot of they're spaced far apart on the status page. The "uptime" is time-driven, not bugs-per-lines-of-code driven. DevOps' purpose is to teach you that you can move quickly without breaking stuff, but it requires a particular way of working, that emphasizes building trust. You can't just ship random stuff 100x faster and assume it will work. This is what the "move fast and break stuff" people learned the hard way years ago. And breaking stuff isn't inherently bad - if you learn from your mistakes and make the system better afterward. The problem is, that's extra work that people don't want to do. If you don't have an adult in the room forcing people to improve, you get the disasters of the past month. An example: Google SREs give teams error budgets; the SREs are acting as the adult in the room, forcing the team to stop shipping and fix their quality issues. One way to deal with this in DevOps/Lean/TPS is the Andon cord. Famously a cord introduced at Toyota that allows any assembly worker to stop the production line until a problem is identified and a fix worked on (not just the immediate defect, but the root cause). This is insane to most business people because nobody wants to stop everything to fix one problem, they want to quickly patch it up and keep working, or ignore it and fix it later. But as Ford/GM found out, that just leads to a mountain of backlogged problems that makes everything worse. Toyota discovered that if you take the long, painful time to fix it immediately, that has the opposite effect, creating more and more efficiency, better quality, fewer defects, and faster shipping. The difference is cultural. This is real DevOps. If you want your AI work to be both high quality and fast, I recommend following its suggestions. Keep in mind, none of this is a technical issue; it's a business process isssue. | |||||||||||||||||
| ▲ | hackertyper69 5 hours ago | parent | next [-] | ||||||||||||||||
It's a systems engineering job. You need to provide context, acceptable failure modes, and test at each level for validation. Identify false coupling, poor interfaces, things that don't match business context during agent planning phase. Then communicate / translate to others so their decisions improve instead of destroying the system by optimizing only for their local situation. | |||||||||||||||||
| ▲ | pixl97 6 hours ago | parent | prev | next [-] | ||||||||||||||||
It also seems like massive consolidation has caused issues too. Everyone is on Github. Everyone is on AWS. Everyone is behind cloudflare. Whenever an issue happens here it effects everyone and everyone sees it. In the past with smaller services those services did break all the time, but the outage was limited to a much smaller area. Also systems were typically less integrated with each other so one service being down rarely took out everything. | |||||||||||||||||
| |||||||||||||||||
| ▲ | _doctor_love 4 hours ago | parent | prev | next [-] | ||||||||||||||||
Super good take - the Andon cord is needed everywhere. | |||||||||||||||||
| ▲ | zephen 3 hours ago | parent | prev [-] | ||||||||||||||||
> One way to deal with this in DevOps/Lean/TPS is the Andon cord. Many years ago, I started working for chip companies. It was like a breath of fresh air. Successful chip companies know the costs (both direct money and opportuity) of a failed tapeout, so the metaphorical equivalent of this cord was there. Find a bug the morning of tapeout? It will be carefully considered and triaged, and maybe delay tapeout. And, as you point out, the cultural aspect is incredibly important, which means that the messenger won't be shot. | |||||||||||||||||