Remix clone Hacker News

new | show | ask | jobs Github

	▲	eliasdejong 2 days ago
		Really excellent research and well written, congrats. Shows that io_uring really brings extra performance when properly used, and not simply as a drop-in replacement. > With IOPOLL, completion events are polled directly from the NVMe device queue, either by the application or by the kernel SQPOLL thread (cf. Section 2), replacing interrupt-based signaling. This removes interrupt setup and handling overhead but disables non-polled I/O, such as sockets, within the same ring. > Treating io_uring as a drop-in replacement in a traditional I/O-worker design is inadequate. Instead, io_uring requires a ring-per-thread design that overlaps computation and I/O within the same thread. 1) So does this mean that if you want to take advantage of IOPOLL, you should use two rings per thread: one for network and one for storage? 2) SQPoll is shown in the graph as outperforming IOPoll. I assume both polling options are mutually exclusive? 3) I'd be interested in what the considerations are (if any) for using IOPoll over SQPoll. 4) Additional question: I assume for a modern DBMS you would want to run this as thread-per core?
	▲	mjasny 2 days ago \| parent [-]
		Thanks a lot for the kind words, we really appreciate it! Regarding your questions: 1) Yes. If you want to take advantage of IOPOLL while still handling network I/O, you typically need two rings per thread: an IOPOLL-enabled ring for storage and a regular ring for sockets and other non-polled I/O. 2) They are not mutually exclusive. SQPOLL was enabled in addition to IOPOLL in the experiments (+SQPoll). SQPOLL affects submission, while IOPOLL changes how completions are retrieved. 3) The main trade-off is CPU usage vs. latency. SQPOLL spawns an additional kernel thread that busy spins to issue I/O requests from the ring. With IOPOLL interrupts are not used and instead the device queues are polled (this does not necessarily result in 100% CPU usage on the core). 4) Yes. For a modern DBMS, a thread-per-core model is the natural fit. Rings should not be shared between threads; each thread should have its own io_uring instance(s) to avoid synchronization and for locality.