Remix.run Logo
tptacek 4 days ago

Ok I think you're making a well-considered and interesting argument about devicemapper vs. feature-ful filesystems but you're also kind of personalizing this a bit. I want to read more technical stuff on this thread and less about geek cred and yelling. :)

I wouldn't comment but I feel like I'm naturally on your side of the argument and want to see it articulated well.

ajross 4 days ago | parent [-]

I didn't really think it was that bad? But sure, point taken.

My goal was actually the same though: to try to short-circuit the inevitable platform flame by calling it out explicitly and pointing out that the technical details are sort of a solved problem.

ZFS argumentation gets exhausting, and has ever since it was released. It ends up as a proxy for Sun vs. Linux, GNU vs. BSD, Apple vs. Google, hippy free software vs. corporate open source, pick your side. Everyone has an opinion, everyone thinks it's crucially important, and as a result of that hyperbole everyone ends up thinking that ZFS (dtrace gets a lot of the same treatment) is some kind of magically irreplaceable technology.

And... it's really not. Like I said above if it disappeared from the universe and everyone had to use dm/lvm for the actual problems they need to solve with storage management[1], no one would really care.

[1] Itself an increasingly vanishing problem area! I mean, at scale and at the performance limit, virtually everything lives behind a cloud-adjacent API barrier these days, and the backends there worry much more about driver and hardware complexity than they do about mere "filesystems". Dithering about individual files on individual systems in the professional world is mostly limited to optimizing boot and update time on client OSes. And outside the professional world it's a bunch of us nerds trying to optimize our movie collections on local networks; realistically we could be doing that on something as awful NTFS if we had to.

nh2 4 days ago | parent [-]

How can I, with dm/lvm:

* For some detected corruption, be told directly which files are affected?

* Get filesystem level snapshots that are guaranteed to be consistent in the way ZFS and CephFS snapshots guarantee?

ajross 4 days ago | parent [-]

On urging from tptacek I'll take that seriously and not as flame:

1. This is misunderstanding how device corruption works. It's not and can't ever be limited to "files". (Among other things: you can lose whole trees if a directory gets clobbered, you'd never even be able to enumerate the "corrupted files" at all!). All you know (all you can know) is that you got a success and that means the relevant data and metadata matched the checksums computed at write time. And that property is no different with dm. But if you want to know a subset of the damage just read the stderr from tar, or your kernel logs, etc...

2. Metadata robustness in the face of inconsistent updates (e.g. power loss!) is a feature provided by all modern filesystems, and ZFS is no more or less robust than ext4 et. al. But all such filesystems (ZFS included) will "lose data" that hadn't been fully flushed. Applications that are sensitive to that sort of thing must (!) handle this by having some level of "transaction" checkpointing (i.e. a fsync call). ZFS does absolutely nothing to fix this for you. What is true is that an unsynchronized snapshot looks like "power loss" at the dm level where it doesn't in ZFS. But... that's not useful for anyone that actually cares about data integrity, because you still have to solve the power loss problem. And solving the power loss problem obviates the need for ZFS.

koverstreet 4 days ago | parent [-]

1 - you absolutely can and should walk reverse mappings in the filesystem so that from a corrupt block you can tell the user which file was corrupted.

In the future bcachefs will be rolling out auxiliary dirent indices for a variety of purposes, and one of those will be to give you a list of files that have had errors detected by e.g. scrub (we already generally tell you the affected filename in error messages)

2 - No, metadata robustness absolutely varies across filesystems.

From what I've seen, ext4 and bcachefs are the gold standard here; both can recover from basically arbitrary corruption and have no single points of failure.

Other filesystems do have single points of failure (notably btree roots), and btrfs and I believe ZFS are painfully vulnerable to devices with broken flush handling. You can blame (and should) blame the device and the shitty manufacturers, but from the perspective of a filesystem developer, we should be able to cope with that without losing the entire filesystem.

XFS is quite a bit better than btrfs, and I believe ZFS, because they have a ton of ways to reconstruct from redundant metadata if they lose a btree root, but it's still possible to lose the entire filesystem if you're very, very unlucky.

On a modern filesystem that uses b-trees, you really need a way of repairing from lost b-tree roots if you want your filesystem to be bulletproof. btrfs has 'dup' mode, but that doesn't mean much on SSDs given that you have no control over whether your replicas get written to the same erase unit.

Reiserfs actually had the right idea - btree node scan, and reconstruct your interior nodes if necessary. But they gave that approach a bad name; for a long time it was a crutch for a buggy b-tree implementation, and they didn't seed a filesystem specific UUID into the btree node magic number like bcachefs does, so it could famously merge a filesystem from a disk image with the host filesystem.

bcachefs got that part right, and also has per-device bitmaps in the superblock for 'this range of the device has btree nodes' so it's actually practical even if you've got a massive filesystem on spinning rust - and it was introduced long after the b-tree implementation was widely deployed and bulletproof.

magicalhippo 4 days ago | parent | next [-]

> XFS is quite a bit better than btrfs, and I believe ZFS, because they have a ton of ways to reconstruct from redundant metadata if they lose a btree root

As I understand it ZFS also has a lot of redundant metatdata (copies=3 on anything important), and also previous uberblocks[1].

In what way is XFS better? Genuine question, not really familiar with XFS.

[1]: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSMetadata...

koverstreet 4 days ago | parent [-]

I can't speak with any authority on ZFS, I know its structure the least out of all the major filesystems.

I do a ton of reading through forums gathering user input, and lots of people chime in with stories of lost filesystems. I've seen reports of lost filesystems with ZFS and I want to say I've seen them at around the same frequency of XFS; both are very rare.

My concern with ZFS is that they seem to have taken the same "no traditional fsck" approach as btrfs, favoring entirely online repair. That's obviously where we all want to be, but that's very hard to get right, and it's been my experience that if you prioritize that too much you miss the "disaster recovery" scenarios, and that seems to be what's happened with ZFS; I've read that if your ZFS filesystem is toast you need to send it to a data recovery service.

That's not something I would consider acceptable, fsck ought to be able to do anything a data recovery service would do, and for bcachefs it does.

I know the XFS folks have put a ton of outright paranoia into repair, including full on disaster recovery scenarios. It can't repair in scenarios where bcachefs can - but on the other hand, XFS has tricks that bcachefs doesn't, so I can't call bcachefs unequivocally better; we'd need to wait for more widespread usage and a lot more data.

p_l 3 days ago | parent [-]

The lack of traditional 'fsck' is because its operation would be exact same as normal driver operation. The most extreme case involves a very obscure option that lets you explicitly rewind transactions to one you specify, which I've seen used to recover a broken driver upgrade that led to filesystem corruption in ways that most FSCK just barf on, including XFS'

For low-level meddling and recovery, there's a filesystem debugger that understands all parts of ZFS and can help for example identifying previous uberblock that is uncorrupted, or recovering specific data, etc.

koverstreet 3 days ago | parent [-]

Rewinding transactions is cool. Bcachefs has that too :)

What happens on ZFS if you lose all your alloc info? Or are there other single points of failure besides the ublock in the on disk format?

magicalhippo 3 days ago | parent [-]

> What happens on ZFS if you lose all your alloc info?

According to this[1] old issue, it hasn't happened frequently enough to prioritize implementing a rebuild option, however one should be able to import the pool read-only and zfs send it to a different pool.

As far as I can tell that's status quo. I agree it is something that should be implemented at some point.

That said, certain other spacemap errors might be recoverable[2].

[1]: https://github.com/openzfs/zfs/issues/3210

[2]: https://github.com/openzfs/zfs/issues/13483#issuecomment-120...

koverstreet 3 days ago | parent [-]

I take a harder line on repair than the ZFS devs, then :)

If I see an issue that causes a filesystem to become unavailable _once_, I'll write the repair code.

Experience has taught me that there's a good chance I'll be glad I did, and I like the peace of mind that I get from that.

And it hasn't been that bad to keep up on, thanks to lucky design decisions. Since bcachefs started out as bcache, with no persistent alloc info, we've always had the ability to fully rebuild alloc info, and that's probably the biggest and hardest one to get right.

You can metaphorically light your filesystem on fire with bcachefs, and it'll repair. It'll work with whatever is still there and get you a working filesystem again with the minimum possible data loss.

magicalhippo 3 days ago | parent [-]

As I said I do think ZFS is great, but there are aspects where it's quite noticeable it was born in an enterprise setting. That sending, recreating and restoring the pool is a sufficient disaster recovery plan to not warrant significant development is one of those aspects.

As I mentioned in the other subthread, I do think your commitment to help your users is very commendable.

koverstreet 3 days ago | parent [-]

Oh, I'm not trying to diss ZFS at all. You and I are in complete agreement, and ZFS makes complete sense in multi device setups with real redundancy and non garbage hardware - which is what it was designed for, after all.

Just trying to give honest assessments and comparisons.

4 days ago | parent | prev | next [-]
[deleted]
ajross 20 hours ago | parent | prev [-]

> 2 - No, metadata robustness absolutely varies across filesystems.

That's misunderstanding the subthread. The upthread point was about metadata atomicity in snapshots, not hardware corruption recovery. A filesystem like ZFS can make sure the journal is checkpointed atomically with the CoW snapshot moment, where dm obviously can't. And I pointed out this wasn't actually helpful because this is a problem that has to be solved above the filesystem, in databases and apps, because it's isomorphic to power loss (something that the filesystem can't prevent).

nh2 18 hours ago | parent [-]

I believe it is helpful because you can stop an app (such as a DB), FS-snapshot, and then e.g. rsync the snapshot or use any other file based backup tool, and this snapshot is fast and will be correct.

Doing the same with a block device snapshot is not so easy.

ajross 4 hours ago | parent [-]

Again, if your system is "incorrect" having been stopped and snapshotted like that, it is also unsafe vs. power loss, something ZFS cannot save you from. Power loss events are vastly more common than poorly checkpointed database[1] events.

[1] FWIW: every database worth being called a "database" has some level of robust journaling with checkpoints internally. I honestly don't know what software you're talking about specifically except to say that you're likely using it wrong.