| ▲ | nawgz 4 hours ago | |||||||||||||||||||||||||
> a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system ... to keep [that] system up to date with ever changing threats > The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail A configuration error can cause internet-scale outages. What an era we live in Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no? I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me! | ||||||||||||||||||||||||||
| ▲ | mewpmewp2 3 hours ago | parent | next [-] | |||||||||||||||||||||||||
It would have been caught only in stage if there was similar amount of data in the database. If stage has 2x less data it would have never occurred there. Not super clear how easy it would have been to keep stage database exactly as production database in terms of quantity and similarity of data etc. I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | shoo 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
The speed and transparency of Cloudflare publishing this port mortem is excellent. I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out. Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour. In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour. Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production. | ||||||||||||||||||||||||||
| ▲ | norskeld 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
This wild `unwrap()` kinda took me aback as well. Someone really believed in themselves writing this. :) | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | jmclnx 4 hours ago | parent | prev [-] | |||||||||||||||||||||||||
I have to wonder if AI was involved with the change. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||