Remix.run Logo
pests 2 days ago

> but not all checksum failures are impossible to recover from

Which failures are possible to recover from?

jandrewrogers 2 days ago | parent | next [-]

There are two examples I know of that require no additional data:

First, force a re-read of the corrupted page from disk. A significant fraction of data corruption occurs while it is being moved between storage and memory due to weak error detection in that part of the system. A clean read the second time would indicate this is what happened.

Second, do a brute-force search for single or double bit flips. This involves systematically flipping every bit in the corrupted page, recomputing the checksum, and seeing if corruption is detected.

daneel_w 2 days ago | parent [-]

> A significant fraction of data corruption occurs while it is being moved between storage and memory

Surely you mean on the memory bus specifically? SATA and PCIe both have some error correction methods for securing transfers between storage and host controller. I'm not sure about old parallel ATA. While I understand it can happen under conditions similar to non-ECC RAM being corrupted, I don't think I've ever heard or read about a case where a storage device randomly returned erroneous data, short of a legitimate hardware error.

jandrewrogers 2 days ago | parent [-]

The error correction in PCIe, SATA, etc is too weak to be reliable for modern high-performance storage. This is a known deficiency and why re-reading a corrupted page sometimes fixes things. PCIe v6 is introducing a much stronger error detection scheme to address this, which will mostly leave bit-rot on storage media as the major vector.

The bare minimum you want these days is a 64-bit CRC. A strong 128-bit hash would be ideal. Even if you just apply these at the I/O boundaries then you'll catch most corruption. The places it can realistically occur are shrinking but most software makes minimal effort to detect this corruption even though it is a fairly well-bounded problem.

daneel_w 2 days ago | parent [-]

Thanks for the tech details.

dec0dedab0de 2 days ago | parent | prev [-]

what if the corruption only affected the stored checksum, but not the data itself?

kstrauser 2 days ago | parent | next [-]

And you would detect this how?

2 days ago | parent [-]
[deleted]
daneel_w 2 days ago | parent | prev [-]

...then the checksum can't clear the associated operation. Same result.