Typical Linux alarms are based on signals and are very difficult to manage and rescheduling them may have a performance impact since it requires thunking into the kernel. If you use io_uring with userspace timers things can scale much better, but it still requires you to do tricks if you want to support a lot of fast small writes (eg > ~1 million writes per second timer management starts to show up more and more and you have to do some crazy tricks I figured out to get up to 100M writes per second)

▲

Veserv 7 months ago | parent [-]

You do not schedule a timeout on each buffered write. You only schedule one timeout on the transition from empty to non-empty that is retired either when the timeout occurs or when you threshold flush (you may choose to not clear on threshold flush if timeout management is expensive). So, you program at most one timeout per timeout duration/threshold flush.

The point is to guarantee data gets flushed promptly which only fails when not enough data gets buffered. The timeout is a fallback to bound the flush latency.

▲

vlovich123 7 months ago | parent [-]

Yes that can work but as I said that has trade offs.

If you flush before the buffer is full, you’re sacrificing throughput. Additionally the timer firing has additional performance degradation especially if you’re in libc land and only have a sigalarm available.

So when an additional write is added, you want to push out the timer. But arming the timer requires reading the current time among other things and at rates of 10-20Mhz and up reading the current wall clock gets expensive. Even rdtsc approaches start to struggle at 20-40Mhz. You obviously don’t want to do it on every write but you want to make sure that you never actually trigger the timer if you’re producing data at a relatively fast enough clip to otherwise fill the buffer within a reasonable time.

Source: I implemented write coalescing in my nosql database that can operate at a few gigahertz for 8 byte writes/s into an in memory buffer. Once the buffer is full or a timeout occurs, a flush to disk is triggered and I net out at around 100M writes/s (sorting the data for the LSM is one of the main bottlenecks). By comparison DBs like RocksDB can do ~2M writes/s and SQLite can do ~800k.

▲

Veserv 7 months ago | parent [-]

You are not meaningfully sacrificing throughput because the timeout only occurs when you are not writing enough data; you have no throughput to sacrifice. The threshold and timeout should be chosen such that high throughput cases hit the threshold, not the timeout. The timeout exists to bound the worst-case latency of low access throughput.

You only lose throughput in proportion to the handling cost of a single potentially spurious timeout/timeout clear per timeout duration. You should then tune your buffering and threshold to cap that at a acceptable overhead.

You should only really have a problem if you want both high throughput and low latency at which point general solutions are probably not not fit for your use case, but you should remain aware of the general principle.

	▲	vlovich123 7 months ago \| parent [-]
		> You should only really have a problem if you want both high throughput and low latency at which point general solutions are probably not not fit for your use case, but you should remain aware of the general principle. Yes you’ve accurately summarized the end goal. Generally people want high throughput AND low latency, not to just cap the maximum latency. The one shot timer approach only solves a livelock risk. I’ll also note that your throughput does actually drop at the same time as the latency spike because your buffer stays the same size but you took longer to flush to disk. Tuning correctly turns out to be really difficult to accomplish in practice which is why you really want self healing/self adapting systems that behave consistently across all hardware and environments.