▲ | hinkley 6 days ago | |
Here's something you all need to learn about site (or for that matter, tool) reliability: Nobody gives a shit about how many good outcomes between incidents there are. They care about how many good hours happen between incidents, and they care how big the incidents are. So if you make a tool that your coworkers use 5 times as much as the old process, that tool better make things at least 6x more stable or people will start talking about how the process fails 'all the time'. "all the time", as near as I've been able to figure out, after people have been yelling at me, my team, or a team I'm privy to, is not "every day". No, all the time just means that it happens every couple of weeks and one time happened twice in one day, twice in consecutive days, or with two customers in rapid succession. Usually the day they're screaming about. So if you're doing that thing every day all day long, where you used to do it rarely, but you made some progress on making it more frequent, nobody cares that it's every 100th run that fails, when it used to be every 10th. They just see the drama has gotten more frequent (and nowhere near as frequent as their narrative says, but you've already lost that argument) |