Remix.run Logo
lll-o-lll 3 hours ago

> It absolutely can be. Perhaps you are unfamiliar with modern cloud block storage, or RAID backed by NVRAM? Both have durability far above and beyond a single physical disk. On AWS, for example, ec2 Block Express offers 99.999% durability. Alternatively, you can, of course, build your own RAID 1 volumes atop ordinary gp3 volumes if you like to design for similar loss probabilities.

Certainly you can solve for zero data loss (RPO=0) at the infrastructure level. It involves synchronously replicating that data to a separate physical location. If your threat model includes “fire in the dc”, reliable storage isn’t enough. To survive a site catastrophe with no data loss you must maintain a second, live copy (synchronous replication before ack) in another fault domain.

In practice, to my experience, this is done at the application level rather than trying to do so with infrastructure.

> There's no conflict between treating audit logs as logs -- which they are -- with having separate delivery streams and treatment for different retention and durability policies

It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must.

> Again, auditors do not care -- a fact you admitted yourself! They care about whether you took reasonable steps to ensure correctness and availability when needed. That is all.

I care about matching the solution to the regulation; which varies considerably by country and use-case. However there are multiple cases I have been involved with where the stipulation was “you must prove you cannot lose this data, even in the case of a site-wide catastrophe”. That’s what RPO zero means. It’s DR, i.e., after a disaster. For nearly everything 15 minutes is good, if not great. Not always.

otterley 3 hours ago | parent [-]

> It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must.

If you want synchronous replication across fault domains for a specific subset of logs, that’s your choice. My point is that treating them this way doesn’t make them not logs. They’re still logs.

I feel like we’re largely in violent agreement, other than whether you actually need to do this. I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details. As far as I know, no one - no one - has ever been held to account by a regulator for losing covered data due to a catastrophe outside their control, as long as they took reasonable measures to maintain compliance. RPO=0, in my experience, has never been a requirement with strict liability regardless of disaster scenario.

lll-o-lll an hour ago | parent [-]

> I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details.

I can’t go into details about current cases with my current employer, unfortunately. Ultimately, the requirements go through legal and are subject to back and forth with representatives of the government(s) in question. As I said, the problem isn’t passing an audit, it’s getting the initial approval to implement the solution by demonstrating how the requirement will be satisfied. Also, cloud companies are in the same boat, and aren’t certified for use as a result.

This is the extreme end of when you need to be able to say “x definitely happened” or “y definitely didn’t happen” It’s still a “log” from the applications perspective, but really more of a transactional record that has legal weight. And because you can’t lose it, you can’t send it out the “logging” pipe (which for performance is going to sit in a memory buffer for a bit, a local disk buffer for longer, and then get replicated somewhere central), you send it out a transactional pipe and wait for the ack.

Having a gov tell us “this audit log must survive a dc fire” is a bit unusual, but dealing with the general requirement “we need this data to survive a dc fire”, is just another Tuesday. An audit log is nothing special if you are thinking of it as “data”.

You’re a reliability engineer, have you never been asked to ensure data cannot be lost in the event of a catastrophe? Do you agree that this requires synchronous external replication?

otterley an hour ago | parent [-]

> have you never been asked to ensure data cannot be lost in the event of a catastrophe? Do you agree that this requires synchronous external replication?

I have been asked this, yes. But when I tell them what the cost would be to implement synchronous replication in terms of resources, performance, and availability, they usually change their minds and decide not to go that route.