| ▲ | olivia-banks 3 hours ago | |||||||
What happens to it up the callstack? Say they propagated it up the stack with `?`. It has to get handled somewhere. If you don't introduce any logic to handle the duplicate databases, what else are you going to do when the types don't match up besides `unwrap`ing, or maybe emitting a slightly better error message? You could maybe ignore that module's error for that request, but if it was a service more critical than bot mitigation you'd still have the same symptom of getting 500'd. | ||||||||
| ▲ | evil-olive 2 hours ago | parent | next [-] | |||||||
> What happens to it up the callstack? as they say in the post, these files get generated every 5 minutes and rolled out across their fleet. so in this case, the thing farther up the callstack is a "watch for updated files and ingest them" component. that component, when it receives the error, can simply continue using the existing file it loaded 5 minutes earlier. and then it can increment a Prometheus metric (or similar) representing "count of errors from attempting to load the definition file". that metric should be zero in normal conditions, so it's easy to write an alert rule to notify the appropriate team that the definitions are broken in some way. that's not a complete solution - in particular it doesn't necessarily solve the problem of needing to scale up the fleet, because freshly-started instances won't have a "previous good" definition file loaded. but it does allow for the existing instances to fail gracefully into a degraded state. in my experience, on a large enough system, "this could never happen, so if it does it's fine to just crash" is almost always better served by a metric for "count of how many times a thing that could never happen has happened" and a corresponding "that should happen zero times" alert rule. | ||||||||
| ||||||||
| ▲ | acdha 2 hours ago | parent | prev | next [-] | |||||||
The way I’ve seen this on a few older systems was that they always keep the previous configuration around so it can switch back. The logic is something like this: 1. At startup, load the last known good config. 2. When signaled, load the new config. 3. When that passes validation, update the last-known-good pointer to the new version. That way something like this makes the crash recoverable on the theory that stale config is better than the service staying down. One variant also recorded the last tried config version so it wouldn’t even attempt to parse the latest one until it was changed again. For Cloudflare, it’d be tempting to have step #3 be after 5 minutes or so to catch stuff which crashes soon but not instantly. | ||||||||
| ▲ | __turbobrew__ 2 hours ago | parent | prev | next [-] | |||||||
Presumably you kick up the error to a level that says “if parsing new config fails, keep the old config” | ||||||||
| ▲ | tptacek 3 hours ago | parent | prev [-] | |||||||
Yeah, see, that's what I mean. | ||||||||