Remix.run Logo
ori_b 11 hours ago

Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.

mustache_kimono 11 hours ago | parent [-]

> Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.

A few things: 1) Is this an actual ZFS issue you encountered or is this a hypothetical? 2) And -- you don't imagine this would be discovered during a scrub? Why not? 3) But -- you do imagine it would be discovered and repaired by an fsck instead? Why so? 4) If so, wouldn't this just be a bug, like a fsck, not some fundamental limitation of the system?

FWIW I've never seen anything like this. I have seen Linux plus a flaky ALPM implementation drop reads and writes. I have seen ZFS notice at the very same moment when the power dropped via errors in `zpool status`. I do wonder if ext4's fsck or XFS's fsck does the same when someone who didn't know any better (like me!) sets the power management policy to "min_power" or "med_power_with_dipm".

ori_b 4 hours ago | parent [-]

Here's an example: https://www.illumos.org/issues/17734. But it would not be discovered by a scrub because the hashes are valid. Scrubs check hashes, not structure. It would be discovered by a fsck because the structure is invalid. Fscks check structure, not hashes.

They are two different tools, with two different uses.

mustache_kimono 2 hours ago | parent [-]

> Scrubs check hashes, not structure.

How is the structure not valid here? Can you explain to us how an fsck would discover this bug (show an example where an fsck fixed a similar bug) but ZFS could never? The point I take contention with is that missing an fsck is a problem for ZFS, so more specifically can you answer my 4th Q:

>> 4) If so, wouldn't this just be a bug, like (a bug in) fsck, not some fundamental limitation of the system?

So -- is it possible an fsck might discover an inconsistency ZFS couldn't? Sure. Would this be a fundamental flaw of ZFS, which requires an fsck, instead of merely a bug? I'm less sure.

You do seem to at least understand my general contention with the parent's point. However, the parent is also making a specific claim about a bug which would be extraordinary. Parent's claim is this is a bug which a scrub, which is just a read, wouldn't see, but a subsequent read would reveal.

So -- is it possible an fsck might discover this specific kind of extraordinary bug in ZFS, after a scrub had already read back the data? Of that I'm highly dubious.

ori_b an hour ago | parent [-]

> Can you show us how an fsck would discover this bug but ZFS could never?

I'd have to read closer to be certain, but if my understanding of it is correct, you'd have orphaned objects in the file system. The orphaned objects would be detectable, but would have correct hashes.

Here's a better example of one where scrub clearly doesn't catch it, but fsck would (and zdb does). The metaslab structure could be checked in theory, but isn't. https://neurrone.com/posts/openzfs-silent-metaslab-corruptio...

> if so, wouldn't this just be a bug, like (a bug in) fsck, not some fundamental limitation of the system?

It's not a bug or a fundamental limitation of the system, it's just that fsck != scrub, and nobody has written the code for fsck. If someone wanted to write the code, they could. I suspect it wouldn't even be particularly hard.

But fsck != scrub, and they catch different things.