Remix clone Hacker News

I’m just a small business & homelab guy, so I’ll probably never use one of these big distributed file systems. But when people start talking petabytes, I always wonder if these things are actually backed up and what you use for backup and recovery?

▲

ted_dunning 6 days ago | parent | next [-]

It is common for the backup of these systems to be a secondary data center.

Remember that there are two purposes for backup. One is hardware failures, the second is fat fingers. Hardware failures are dealt with by redundancy which always involves keeping redundant information across multiple failure domains. Those domains can be as small as a cache line or as big as a data center. These failures can be dealt with transparently and automagically in modern file systems.

With fat fingers, the failure domain has no natural boundaries other than time. As such, snapshots kept in the file system are the best choice, especially if you have a copy-on-write that can keep snapshots with very little overhead.

There is also the special case of adversarial fat fingering which appears in ransomware. The answer is snapshots, but the core problem is timely detection since otherwise you may not have a single point in time to recover from.

	▲	ghugccrghbvr 5 days ago \| parent [-]
		Disaster at all?

▲

shermantanktop 6 days ago | parent | prev | next [-]

Backup and recovery is a process with a non-zero failure rate. The more you test it, the lower the rate, but there is always a failure mode.

With these systems, the runtime guarantees of data integrity are very high and the failure rate is very low. And best of all, failure is constantly happening as a normal activity in the system.

So once you have data integrity guarantees that are better in you runtime system than your backup process, why backup?

There are still reasons, but they become more specific to the data being stored and less important as a general datastore feature.

▲

Eikon 6 days ago | parent [-]

> why backup?

Because of mistakes and malicious actors...

▲

overfeed 6 days ago | parent [-]

...and the "Disaster" in "Disaster recovery" may have been localized and extensive (fire, flooding, major earthquake, brownouts due to a faulty transformer, building collapse, a solvent tanker driving through the wall into the server room, a massive sinkhole, etc)

▲

shermantanktop 6 days ago | parent [-]

Yes, the dreaded fiber vs. backhoe. But if your distributed file system is geographically redundant, you're not exposed to that, at least from an integrity POV. It sucks that 1/3 or 1/5 or whatever of your serving fleet just disappeared, but backup won't help with that.

	▲	overfeed 6 days ago \| parent [-]
		> But if your distributed file system is geographically redundant Redundancy and backups are not the same thing! There's some overlap, but treating them as interchangeable will occasionally result in terrible outcomes, like when a config change that results in all 5/5 datacenters fragmenting and failing to create a quorum, then finding out your services have circular dependencies when you are trying to bootstrap foundational services. Local backups would solve this, each DC would load last known good config, but rebuilding consensus necessary for redundancy requires coordination from now-unreachable hosts.

▲

KaiserPro 5 days ago | parent | prev | next [-]

Depends on what the data is.

Because of the replication factor here, I assume that this filesystem is optimised for read throughput rather than capacity. Either way, there is a concept of "nearline" storage. Its a storage tier that is designed to be only really accesed by a backupagent. The general idea is that it stores a snapshot of the main file system every n hours.

After that you have as many snapshots as you can afford.

▲

huntaub 6 days ago | parent | prev | next [-]

Well, for active data, the idea is that the replication within the system is enough to keep the data alive from instance failure (assuming that you're doing the proper maintenance and repairing hosts pretty quickly after failure). Backup and recovery, in that case, is used more for saving yourself against fat-fingering an "rm -rf /" type command. Since it's just a file system, you should be able to use any backup and recovery solution that works with regular files.

▲

dilyevsky 5 days ago | parent | prev [-]

> what you use for backup and recovery

Speaking from experience working at a hyperscaler - 1. cross-regional mirroring 2. Good old tape backups