I always wondered why the fsync has to be lazy. It seems like the fsync's can be bundled up together, and the notification messages held for a few millis while the write completes. Similar to TCP corking. There doesn't need to be one fsync per consensus.

▲

kbenson 44 minutes ago | parent | next [-]

That was my immediate thought as well, under the assumption the lazy fsync is for performance. I imagine in some situations, delaying the write until the write confirmation actually happens is okay (depending on delay), but it also occurred to me that if you delay enough, and you have a busy enough system, and your time to send the message is small enough, the number of open connections you need to keep open can be some small or large multiple of the amount you would need without delaying the confirmation message to actual write time.

▲

aphyr 3 hours ago | parent | prev | next [-]

Yes, good call! You can batch up multiple operations into a single call to fsync. You can also tune the number of milliseconds or bytes you're willing to buffer before calling `fsync` to balance latency and throughput. This is how databases like Postgres work by default--see the `commit_delay` option here: https://www.postgresql.org/docs/8.1/runtime-config-wal.html

▲

to11mtm 2 hours ago | parent [-]

> This is how databases like Postgres work by default--see the `commit_delay` option here: https://www.postgresql.org/docs/8.1/runtime-config-wal.html

I must note that the default for Postgres is that there is NO delay, which is a sane default.

> You can batch up multiple operations into a single call to fsync.

Ive done this in various messaging implementations for throughput, and it's actually fairly easy to do in most languages;

Basically, set up 1-N writers (depends on how you are storing data really) that takes a set of items containing the data to be written alongside a TaskCompletionSource (Promise in Java terms), when your stuff wants to write it shoots it to that local queue, the worker(s) on the queue will write out messages in batches based on whatever else (i.e. tuned for write size, number of records, etc for both throughput and guaranteeing forward progress,) and then when the write completes you either complete or fail the TCS/Promise.

If you've got the right 'glue' with your language/libraries it's not that hard; this example [0] from Akka.NET's SQL persistence layer shows how simple the actual write processor's logic can be... Yeah you have to think about queueing a little bit however I've found this basic pattern very adaptable (i.e. queueing op can just send a bunch of ready-to-go-bytes and you work off that for threshold instead, add framing if needed, etc.)

[0] https://github.com/akkadotnet/Akka.Persistence.Sql/blob/7bab...

▲

aphyr 2 hours ago | parent [-]

Ah, pardon me, spoke too quickly! I remembered that it fsynced by default, and offered batching, and forgot that the batch size is 0 by default. My bad!

	▲	to11mtm 2 hours ago \| parent [-]
		Well the write is still tunable so you are still correct. Just wanted to clarify that the default is still at least safe in case people perusing this for things to worry about, well, were thinking about worrying. Love all of your work and writings, thank you for all you do!

▲

senderista an hour ago | parent | prev [-]

In practice, there must be a delay (from batching) if you fsync every transaction before acknowledging commit. The database would be unusably slow otherwise.