Remix.run Logo
menaerus 5 hours ago

So perhaps a mixed read+write workload would be more interesting, no? Write-only is characteristic of ingestion workloads. That said, libaio vs io_uring difference is interesting. Did you perhaps run a perf profile to understand where the differences are coming from? My gut feeling is that it is not necessarily an artifact of less context-switching with io_uring but something else.

eivanov89 5 hours ago | parent [-]

There are a couple of challenges with mixed read+write workloads on NVMe.

In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior.

For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it.

As for profiling: we didn’t run perf, so the following is my educated guess:

1. With libaio, control structures are copied as part of submission/completion. io_uring avoids some of this overhead via shared rings and pre-registered resources. 2. In our experience (in YDB), AIO syscall latency tends to be less predictable, even when well-tuned. 3. Although we report throughput, the setup is effectively latency-bound (single fio job). With more concurrency, libaio might catch up.

We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable.