Remix.run Logo
The Case for a High-Level Kernel-Bypass I/O Abstraction (2019)(irenezhang.net)
61 points by eventhelix 7 months ago | 15 comments
blibble 7 months ago | parent | next [-]

7-10us for what is a hashtable set/get is really, really bad

I can get a packet out to a switch and back to another machine and in 1-2us

gtirloni 7 months ago | parent [-]

Do you mean 1-2ms?

joeblubaugh 7 months ago | parent | prev | next [-]

It’s really frustrating that the HotOS paper itself has no details about the benchmarking, and the blog post just says “redis benchmark”. What was the system setup? Persistence options? What was ported to demikernel? The client writing, the server reading from the NIC? Based on the problem specified in the paper, I assume its reading from the NIC that was implemented in DemiOS

FridgeSeal 7 months ago | parent | prev | next [-]

This is a super cool idea, and it’s something that sounds fun to play with/try out.

Therefore, I eagerly await the inevitable influx of:

- “you don’t need it”

- “you’re not FAANG enough to justify it”,

-“seems overly complicated my Python-on-Ubuntu-is-good-enough and who needs more”

Style comments telling us why we shouldn’t have fun things like this.

Anyone got anymore comments to add to the bingo-card?

wmf 7 months ago | parent | next [-]

Preemptive cynicism is even worse than regular cynicism.

dijksterhuis 7 months ago | parent | prev [-]

if you personally want to play with it, go ahead.

i think my personal feeling is that those sorts of comments you listed come out of the woodwork more when the comments section starts turning into an "oh man, this should be the standard for everyone" kind of discussion, which is never the case and is usually the point of those kinds of replies.

at least they are when i reply with those kinds of comments anyway

kd913 7 months ago | parent | prev | next [-]

What is being asked for already exists? It is called Onload.

https://github.com/Xilinx-CNS/onload

a-dub 7 months ago | parent [-]

it is my understanding that io_uring is the generalized open source implementation of this, although i do not think it bypasses the kernel fib trie like openonload does...

secondcoming 7 months ago | parent | prev | next [-]

I looked at using DPDK on some of our GCP instances but it requires setting up a second VPC, which was one hurdle too much.

I’m hoping that io_uring makes all of this unnecessary anyway.

I recall reading a paper where someone noticed that for every packet the Linux kernel receives it has to check if any application has opened a raw socket. Raw sockets are initially needed to allow DHCP to work, so once your machine has been assigned an IP address you can (probably) turn this service off and so give the kernel less work to do. (My memory of the exact details may be sketchy).

Polizeiposaune 7 months ago | parent | next [-]

DHCP issues address leases, not permanent assignments; leases have an expiration time (and earlier suggested renewal/rebind times). So the DHCP client must periodically renew -- if the tenant doesn't renew (perhaps because the DHCP client has been disabled), the DHCP service may lease the address to another tenant.

If the DHCP server hasn't moved to a new address this renewal can be done over unicast using the leased address - however, if the client doesn't receive a response from the server the client state machine will eventually discard the leased address and fall back to broadcast with an all-zeros source address (which is presumably what requires a raw socket).

The DHCP client implementation in question likely keeps the raw socket open for potential future use in this case. A client might be able to close the raw socket and reopen it later (but security folks might also want it to drop the privilege required to reopen the raw socket, and it might be hard to have an ironclad guarantee that the raw socket can be reopened later on a machine that's short on free kernel memory..).

Matthias247 7 months ago | parent | prev [-]

io_uring reduces the overhead of system calls - but it doesn't do anything to reduce the overhead of the actual networking stack.

If your send/receive calls spend most CPU time in going through routing/fragmentation/filter/BPF/etc path in the networking stack, then uring (or other APIs which just reduce the system call overhead, like SendMmsg/Recvmmsg for UDP) might only make a small difference. Source: Lots of profiling while implementing QUIC libraries.

An alternative to DPDK that allows to bypass the kernel networking stack would be AF_XDP.

r00tbeer 7 months ago | parent | prev | next [-]

See https://irenezhang.net/papers/demikernel-sosp21.pdf for a more thorough paper on the Demikernel from 2021. There are some great ideas for improving the kernel interface while still allowing efficient DPDK-style pipelines.

crest 7 months ago | parent | prev | next [-]

For a such an interface to be feasible to support in common open source infrastructure it needs a pure software implementation for testing and development purposes. Even better something along the lines of coz to even model performance by throttling down everything else proportionally.

Gollapalli 7 months ago | parent | prev | next [-]

This is great! I think that there are a lot of latency sensitive applications which really do need to spare the kernel latency.

7 months ago | parent | prev [-]
[deleted]