| ▲ | EvanAnderson 3 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||
It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it. The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside. The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy. | |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | tptacek 3 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is). | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | navigate8310 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
I'm amazed that they are not using any simulator of some sort and pushing changes directly to production. | |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Aeolun 3 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||
I’m fairly certain it will be after they read this thread. It doesn’t feel like they don’t want, or are incapable of improving? | |||||||||||||||||||||||||||||||||||||||||||||||