Remix.run Logo
daneel_w 2 days ago

The WAL was corrupted, the actual data is lost. There's no parity. You're suggesting that sqlite should somehow recreate the data from nothing.

AlotOfReading 2 days ago | parent | next [-]

I was assuming sqlite did the sane thing and used a CRC. CRCs have (limited) error correction capabilities that you can use to fix 1-2 bit errors in most circumstances, but apparently sqlite uses a Fletcher variant and gives up that ability (+ long message length error detection) for negligible performance gains on modern (even embedded) CPUs.

lxgr 2 days ago | parent [-]

The ability to... correct 1-2 bit errors? Is that even a realistic failure mode on common hardware?

CRCs as used in SQLite are not intended to detect data corruption due to bit rot, and are certainly not ECCs.

AlotOfReading 2 days ago | parent [-]

Yes, it's a pretty important error case with storage devices. It's so common that modern storage devices and filesystems include their own protections against it. Your system may or may not have these and bit flips may happen after that point, so WAL redundancy wouldn't be out of place.

Sure, the benefits to the incomplete write use case are limited, but there's basically no reason to ever use a fletcher these days.

It's also worth mentioning that the VFS checksums are explicitly documented as guarding against storage device bitrot and use the same fletcher algorithm.

lxgr 2 days ago | parent [-]

It would be absolutely out of place if your lower layers already provided it, and not enough if they didn't (since you'd then also need checksums on the actual database file, which SQLite does not provide – all writes happen again there!)

AlotOfReading 2 days ago | parent [-]

sqlite does actually provide database checksums, via the vfs extension I mentioned previously.

There's no harm to having redundant checksums and it's not truly redundant for small messages. It's pretty common for systems not to have lower level checksumming either. Lots of people are still running NTFS/EXT4 on hardware that doesn't do granular checksums or protect data in transit.

Of course this is all a moot point because sqlite does WAL checksums, it just does them with an obsolete algorithm.

lxgr 2 days ago | parent [-]

> There's no harm to having redundant checksums

There sure is: Redundant checksums need extra storage and extra processing. SQLite often runs on embedded systems, where both can come at a premium.

> Of course this is all a moot point because sqlite does WAL checksums, it just does them with an obsolete algorithm.

That's not nearly the only thing missing for SQLite to provide full resistance to lower-level data corruption. At a very minimum, you'd also need checksums of the actual database file.

AlotOfReading 2 days ago | parent [-]

There's no extra storage because it's already storing checksums. There's technically extra processing, but a good CRC is faster than reading memory on virtually all modern systems (even in embedded) so there's no reason to prefer a fletcher unless the few extra cycles of latency to get the CRC running are deeply concerning to you.

    That's not nearly the only thing missing for SQLite to provide full resistance to lower-level data corruption.
A CRC wouldn't provide full resistance anyway, so this clearly isn't what I'm arguing. What I'm saying is that a fletcher is strictly worse than a CRC here. Not an advanced algorithm, a bog-standard CRC. Fletcher actually takes more space in this case, has worse error detection, and doesn't allow recovery. A CRC would allow you to recover some of the time, from some kinds of errors, in some scenarios. It's simply better than nothing.
lxgr 2 days ago | parent [-]

> There's no extra storage because it's already storing checksums.

At the database level (i.e. not just the WAL)? Are you sure?

> What I'm saying is that a fletcher is strictly worse than a CRC here.

I can't speak to the performance differences, but the only thing SQLite really needs the checksum to do is to expose partial writes, both due to reordered sector writes and partial intra-sector writes. (The former could also be solved by just using an epoch counter, but the latter would require some tricky write formats, and a checksum nicely addresses both).

In both cases, there's really nothing to recover: CRC won't catch an entire missing sector, and almost no partially written sectors (i.e. unless the failure somehow happens in the very last bytes of it, so that the ratio of "flipped" bits is low enough).

AlotOfReading 2 days ago | parent [-]

Just talking about the WAL. The database does have a page level checksum with the VFS extension, but it's optional.

avinassh 2 days ago | parent | prev [-]

> The WAL was corrupted, the actual data is lost. There's no parity. You're suggesting that sqlite should somehow recreate the data from nothing.

Not all frames in the WAL are important. Sure, recovery may be impossible in some cases, but not all checksum failures are impossible to recover from.

daneel_w 2 days ago | parent | next [-]

My feeling is that any portion that no longer checksums correctly is practically forfeited, which makes any form of recovery an arbitrary result - a half-baked transaction. I don't see how that's a more desirable outcome than rolling back to the last known-good commit, which is never arbitrary.

avinassh 2 days ago | parent [-]

I posted in another comment: There is a smoll demo in the post. In it, I corrupt an old frame that is not needed by the database at all. Now, one approach would be to continue the recovery and then present both states: one where the WAL is dropped, and another showing whatever we have recovered. If I had such an option, I would almost always pick the latter.

nemothekid 2 days ago | parent | next [-]

>one where the WAL is dropped, and another showing whatever we have recovered. If I had such an option, I would almost always pick the latter

I can't imagine picking the latter unless you were treating sqlite like a filesystem of completely unrelated blobs.

If I run three transactions where:

1. John gives $100 to Sue.

2. Sue gives $100 to Mark.

3. Mark $100 money to Paul.

If sqlite, just erases transaction (2), then Mark materializes $100 from nowhere. The rest of your database is potentially completely corrupted. At that point your database is no longer consistent - I can't see how you would "almost always" prefer this.

If (2) is corrupt, then the restore stops at (1), and you are guaranteed consistency.

avinassh 2 days ago | parent [-]

Yes, this is a good example of showing an we cannot partly apply the WAL always. Again, let me repeat it, we cannot partly apply the WAL all the time expect it to work but there are some valid cases where we can do that to recover. Your example is not the one.

2 days ago | parent | prev [-]
[deleted]
pests 2 days ago | parent | prev | next [-]

> but not all checksum failures are impossible to recover from

Which failures are possible to recover from?

jandrewrogers 2 days ago | parent | next [-]

There are two examples I know of that require no additional data:

First, force a re-read of the corrupted page from disk. A significant fraction of data corruption occurs while it is being moved between storage and memory due to weak error detection in that part of the system. A clean read the second time would indicate this is what happened.

Second, do a brute-force search for single or double bit flips. This involves systematically flipping every bit in the corrupted page, recomputing the checksum, and seeing if corruption is detected.

daneel_w 2 days ago | parent [-]

> A significant fraction of data corruption occurs while it is being moved between storage and memory

Surely you mean on the memory bus specifically? SATA and PCIe both have some error correction methods for securing transfers between storage and host controller. I'm not sure about old parallel ATA. While I understand it can happen under conditions similar to non-ECC RAM being corrupted, I don't think I've ever heard or read about a case where a storage device randomly returned erroneous data, short of a legitimate hardware error.

jandrewrogers 2 days ago | parent [-]

The error correction in PCIe, SATA, etc is too weak to be reliable for modern high-performance storage. This is a known deficiency and why re-reading a corrupted page sometimes fixes things. PCIe v6 is introducing a much stronger error detection scheme to address this, which will mostly leave bit-rot on storage media as the major vector.

The bare minimum you want these days is a 64-bit CRC. A strong 128-bit hash would be ideal. Even if you just apply these at the I/O boundaries then you'll catch most corruption. The places it can realistically occur are shrinking but most software makes minimal effort to detect this corruption even though it is a fairly well-bounded problem.

daneel_w 2 days ago | parent [-]

Thanks for the tech details.

dec0dedab0de 2 days ago | parent | prev [-]

what if the corruption only affected the stored checksum, but not the data itself?

kstrauser 2 days ago | parent | next [-]

And you would detect this how?

2 days ago | parent [-]
[deleted]
daneel_w 2 days ago | parent | prev [-]

...then the checksum can't clear the associated operation. Same result.

2 days ago | parent | prev [-]
[deleted]