| ▲ | lll-o-lll 3 hours ago | ||||||||||||||||
> It absolutely can be. Perhaps you are unfamiliar with modern cloud block storage, or RAID backed by NVRAM? Both have durability far above and beyond a single physical disk. On AWS, for example, ec2 Block Express offers 99.999% durability. Alternatively, you can, of course, build your own RAID 1 volumes atop ordinary gp3 volumes if you like to design for similar loss probabilities. Certainly you can solve for zero data loss (RPO=0) at the infrastructure level. It involves synchronously replicating that data to a separate physical location. If your threat model includes “fire in the dc”, reliable storage isn’t enough. To survive a site catastrophe with no data loss you must maintain a second, live copy (synchronous replication before ack) in another fault domain. In practice, to my experience, this is done at the application level rather than trying to do so with infrastructure. > There's no conflict between treating audit logs as logs -- which they are -- with having separate delivery streams and treatment for different retention and durability policies It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must. > Again, auditors do not care -- a fact you admitted yourself! They care about whether you took reasonable steps to ensure correctness and availability when needed. That is all. I care about matching the solution to the regulation; which varies considerably by country and use-case. However there are multiple cases I have been involved with where the stipulation was “you must prove you cannot lose this data, even in the case of a site-wide catastrophe”. That’s what RPO zero means. It’s DR, i.e., after a disaster. For nearly everything 15 minutes is good, if not great. Not always. | |||||||||||||||||
| ▲ | otterley 3 hours ago | parent [-] | ||||||||||||||||
> It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must. If you want synchronous replication across fault domains for a specific subset of logs, that’s your choice. My point is that treating them this way doesn’t make them not logs. They’re still logs. I feel like we’re largely in violent agreement, other than whether you actually need to do this. I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details. As far as I know, no one - no one - has ever been held to account by a regulator for losing covered data due to a catastrophe outside their control, as long as they took reasonable measures to maintain compliance. RPO=0, in my experience, has never been a requirement with strict liability regardless of disaster scenario. | |||||||||||||||||
| |||||||||||||||||