> - Erasure coding is much more performant than ZFS's

any plans for much lower rates than typical raid?

Increasingly modern high density devices are having block level failures at non-trivial rates instead of or in addition to whole device failures. A file might be 100,000 blocks long, adding 1000 blocks of FEC would expand it 1% but add tremendous protection against block errors. And can do so even if you have a single piece of media. Doesn't protect against device failures, sure, though without good block level protection device level protection is dicey since hitting some block level error when down to minimal devices seems inevitable and having to add more and more redundant devices is quite costly.

▲

koverstreet 3 days ago | parent [-]

It's been talked about. I've seen some interesting work to use just a normal checksum to correct single bit errors.

If there's an optimized implementation we can use in the kernel, I'd love to add it. Even on modern hardware, we do see bit corruption in the wild, it would add real value.

▲

nullc 3 days ago | parent [-]

It's pretty straight forward to use a normal checksum to correct single or even more bit errors (depending on the block size, choice of checksum, etc). Though I expect those bit errors are bus/ram, and hopefully usually transient. If there is corruption on the media, the whole block is usually going to be lost because any corruptions means that its internal block level FEC has more errors than it can fix.

I was more thinking along the lines of adding dozens or hundreds of correction blocks to a whole file, along the lines of par (though there are much faster techniques now).

▲

koverstreet 3 days ago | parent [-]

You'd think that, wouldn't you? But there are enough moving parts in the IO stack below the filesystem that we do see bit errors. I don't have enough data to do correlations and tell you likely causes, but they do happen.

I think SSDs are generally worse than spinning rust (especially enterprise grade SCSI kit), the hard drive vendors have been at this a lot longer and SSDs are massively more complicated. From the conversations I've had with SSD vendors, I don't think they've put the some level of effort into making things as bulletproof as possible yet.

▲

nullc 3 days ago | parent [-]

One thing to keep in mind is that correction always comes as some expense of detection.

Generally a code that can always detect N errors can only always correct N/2 errors. So you detect an errored block, you correct up to N/2 errors. The block now passes but if the block actually had N errors, your correction will be incorrect and you now have silent corruption.

The solution to this is just to have an excess of error correction power and then don't use all of it. But that can be hard to do if you're trying to shoehorn it into an existing 32-bit crc.

How big are the blocks that the CRC units cover in bcachefs?

	▲	koverstreet 3 days ago \| parent [-]
		bcachefs checksums (and compresses) at extent granularity, not block; encoded extents (checksummed/compressed) are limited to 128k by default. This is a really good tradeoff in practice; the vast majority of applications are doing buffered IO, not small block O_DIRECT reads - that really only comes up in benchmarks :) And it gets us better compression ratios and better metadata overhead. We also have quite a bit of flexibility to add something bigger to the extent for FEC, if we need to - we're not limited to a 32/64 bit checksum.