▲ | stapedium 6 days ago | ||||||||||||||||||||||||||||||||||
I’m just a small business & homelab guy, so I’ll probably never use one of these big distributed file systems. But when people start talking petabytes, I always wonder if these things are actually backed up and what you use for backup and recovery? | |||||||||||||||||||||||||||||||||||
▲ | ted_dunning 6 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
It is common for the backup of these systems to be a secondary data center. Remember that there are two purposes for backup. One is hardware failures, the second is fat fingers. Hardware failures are dealt with by redundancy which always involves keeping redundant information across multiple failure domains. Those domains can be as small as a cache line or as big as a data center. These failures can be dealt with transparently and automagically in modern file systems. With fat fingers, the failure domain has no natural boundaries other than time. As such, snapshots kept in the file system are the best choice, especially if you have a copy-on-write that can keep snapshots with very little overhead. There is also the special case of adversarial fat fingering which appears in ransomware. The answer is snapshots, but the core problem is timely detection since otherwise you may not have a single point in time to recover from. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
▲ | shermantanktop 6 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Backup and recovery is a process with a non-zero failure rate. The more you test it, the lower the rate, but there is always a failure mode. With these systems, the runtime guarantees of data integrity are very high and the failure rate is very low. And best of all, failure is constantly happening as a normal activity in the system. So once you have data integrity guarantees that are better in you runtime system than your backup process, why backup? There are still reasons, but they become more specific to the data being stored and less important as a general datastore feature. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
▲ | KaiserPro 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Depends on what the data is. Because of the replication factor here, I assume that this filesystem is optimised for read throughput rather than capacity. Either way, there is a concept of "nearline" storage. Its a storage tier that is designed to be only really accesed by a backupagent. The general idea is that it stores a snapshot of the main file system every n hours. After that you have as many snapshots as you can afford. | |||||||||||||||||||||||||||||||||||
▲ | huntaub 6 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Well, for active data, the idea is that the replication within the system is enough to keep the data alive from instance failure (assuming that you're doing the proper maintenance and repairing hosts pretty quickly after failure). Backup and recovery, in that case, is used more for saving yourself against fat-fingering an "rm -rf /" type command. Since it's just a file system, you should be able to use any backup and recovery solution that works with regular files. | |||||||||||||||||||||||||||||||||||
▲ | dilyevsky 5 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
> what you use for backup and recovery Speaking from experience working at a hyperscaler - 1. cross-regional mirroring 2. Good old tape backups |