Remix.run Logo
mightyham 3 days ago

Unless I am mistaken, it seems like there is a glaring flaw in this scheme, which is that without fsync you cannot guarantee the previous WAL blocks have been persisted before the current one, so a power loss event could leave a hole in the log and cause erroneous recovery. I believe that SSDs reorder writes internally so even having atomic batched O_DIRECT is not a strong enough guarantee for durability. I'll admit that I could be misunderstanding something about the system that alleviates this concern.

hedora 3 days ago | parent | next [-]

Assuming O_DIRECT actually blocks until the SSD has acked (this isn't actually what O_DIRECT's contract says, but what they rely on), you have to wait until each page write acks whenever you need a persistence barrier.

My guess is the preallocation + zeroing is what got them most of the win, and the O_DIRECT is actually hurting, not helping throughput. This has been the case 100% of the time I've benchmarked such things.

If you're doing this sort of stuff for real under Linux, check out sync_file_range. It's the only non-broken and performant sync API for ext4 (note that it's broken by design for many other file systems, and the API is terribly difficult to use correctly).

If you really care, it's probably just easier to use SPDK or something. Linux has historically been pretty hostile towards DBMS implementations.

thomas_fa 3 days ago | parent [-]

That's a lot of valuable information and thanks for the input. Yes the original blog post is mainly focusing on reducing the metadata overhead due to fsync(), and I got a lot of good feedback from here and a lot of discussion is beyond our original scenario settings. We would like to incorporate all these enhancement suggestions without re-introducing fsync(), and make it work for more general environments.

jandrewrogers 3 days ago | parent | prev | next [-]

Many storage devices guarantee that all successful DMA (e.g. O_DIRECT) writes are persisted even in the event of a power loss. This does not work on storage devices that do not offer this guarantee obviously. It also does not work if the filesystem does not support direct I/O or requires metadata updates.

This is not a new trick. It has been used in many storage engine designs to effect durability without an fsync.

mightyham 3 days ago | parent [-]

Thanks, that's interesting and I wasn't aware of that. Is there a consistent way to detmine if a device offers this garuntee at runtime on Linux?

seebeen 3 days ago | parent | prev | next [-]

I also asked what happens when a power loss happens.

zzsheng 3 days ago | parent | prev | next [-]

thanks for feedback. actually it was pointed out in blog that we do not use append only log to avoid fsync due to size change. what we use is preallocate fixed size log file and we do write journal data and space reclaim by 4KB unit, also with direct-io.

convolvatron 3 days ago | parent | prev [-]

if there is a hole in the log then the end of the log is before the hole. you do have to have checksums on log chunks, and better a kind of rolling hash, but you're really just talking about he number of entires that we would have liked to commit but didn't

mightyham 3 days ago | parent [-]

Yeah this is a good point, and maybe a hole wasn't the right way to explain myself. The point is that the way a WAL is supposed to work is that the main data store always lags behind the WAL, so that if a partial operation (always idempotent) occurs on shutdown it is replayed on start up and fixed. In the case I describe, because of a lack of fsync it's possible for the WAL to lag the main data store, so partial operations will not be fixed on start up.

convolvatron 3 days ago | parent [-]

that's a much more interesting problem. fundamentally we're in a bad position by having two different formats, one optimized for writing and one for reading, that admit inconsistency between them. Postgres mitigates this slightly by having page level updates to the read indices also be present in the log (physiological), but that's always seemed like a huge waste to me.

if we give ourselves two definitions of persisted - logically(wal or write) and physically (index or read), it seems like we can maintain the invariant that P < L. (1) by keeping an in memory view of P-L that we have to consult on every read to assert eh delta and (2) an expensive but asynchronous flush path for updating P driven from reads verifying L has landed, then have we patched all the holes(?).

edit: of course one of the root problems here is the drive lying, so how can we understand that some log block has actually commit so that we can update P