The Case for a High-Level Kernel-Bypass I/O Abstraction (2019)

blibble a year ago | parent | next [-]

7-10us for what is a hashtable set/get is really, really bad

I can get a packet out to a switch and back to another machine and in 1-2us

▲

gtirloni a year ago | parent [-]

Do you mean 1-2ms?

▲

eqvinox a year ago | parent [-]

No, 1-2us is correct for that — in a datacenter, with cut-through switching.

▲

gtirloni a year ago | parent | next [-]

That's really impressive. I need to update myself on this topic. Thanks.

▲

mickg10 a year ago | parent [-]

In reality - with decent switches at 25g - and no fec - node to node is reliably under 300ns (0.3 us)

▲

znyboy a year ago | parent | next [-]

Considering that 300 light-nanoseconds is about 90m, getting a response (or even just one-way) in that time is essentially running right at the limits of physics/causality.

	▲	a year ago \| parent [-]
		[deleted]

▲

davekeck a year ago | parent | prev | next [-]

Out of curiosity, how is that measured across machines?

(The first thing that comes to my mind would be to use an oscilloscope with two probes, one to each machine, but I’m guessing that’s not it.)

▲

toast0 a year ago | parent [-]

Measure the round trip and divide by two for the approximate one way time. It'd be really neat to measure the time it takes for a packet to travel in one direction, but it's somewhere between hard and impossible[1]; a very short path has less room to be asymetric though.

[1] If the clocks are synchronized, you can measure send time on one end, and receive time on the other. But synchronizing clocks involves estimating the time it takes for signals to pass im each direction, typically assuming each direction takes half the round trip.

▲

pkhuong a year ago | parent [-]

You can use something like White Rabbit (https://en.wikipedia.org/wiki/White_Rabbit_Project) to keep clocks in sync. That still involves estimates, but a dedicated time sync network can do things like make sure all the cables are the same length.

	▲	namibj a year ago \| parent [-]
		Copper white rabbit is special, it uses the same wire in both directions (1000BASE-T phy with added carrier phase lock to and from outside clocks).

▲

a year ago | parent | prev [-]

[deleted]

▲

jiggawatts a year ago | parent | prev [-]

Meanwhile the best network I’ve ever benchmarked was AWS and measured about 55µs for a round trip!

What on earth are you using that gets you down to single digits!?

▲

Galanwe a year ago | parent | next [-]

> the best network I’ve ever benchmarked was AWS and measured about 55µs for a round trip

What is "a network" here?

Few infrastructures are optimised for latency, most are geared toward providing high throughput instead.

In fact, apart from HFT, I don't think most businesses are all that latency sensitive. Most infrastructure providers will give you SLAs of high single or low double digits microseconds from Mahwa/Carteret to NY4, but these are private/dedicated links. There's little point to optimising latency when your network ends up on internet where the smallest hops are milliseconds away.

▲

jiggawatts a year ago | parent [-]

> There's little point to optimising latency when your network ends up on internet where the smallest hops are milliseconds away.

That's just plain wrong. Lower latency always improves everything. Not just responsiveness, but also bandwidth! Because of TCP slow-start and congestion control algorithms, lower latency directly results in higher throughputs.

Not to mention that these latencies add up, which is especially important with chatty microservices applications. Don't forget that typical TCP+HTTPS connections require something like 5 round trips, and that's assuming that the DNS record is already cached! Add in firewalls, load balancers, proxies, side-cars, ingress, and who knows what else, suddenly you're staring down the barrel of 15 millisecond latencies before the data can exit the data centre.

The threshold for "instant" response is 16.7 ms end-to-end, including refreshing the HTML DOM and painting pixels to the screen.

Google and AWS knows this, which is why their data centre networking have ~50µs latencies, some of the best in the industry.

Everyone else: "Nah, don't bother!"

	▲	Galanwe a year ago \| parent [-]
		I think you're getting pissed of at a strawman. Everyone obviously _care_ about latency. All things equal, better latency always makes things better, there is no arguing with that. Yet, that doesn't mean latency is at the same priority spot on everyone's list. If you're using TCP on internet, you have already put latency far down in your concerns. That doesn't make you _not want_ better latency, but that does make it a _nice to have_. There's no obvious shortcut to latency that doesn't involve either loosing on reliability (not requiring ordered messages, not re-requesting dropped messages), or loosing throughput (not assembling small messages on bigger ones), or limiting yourself to private links. If you do all the above (as in TCP over the internet), then you've made no sacrifice for latency over throughput and resiliency, which to me makes latency a nice to have, but certainly not a primary concern.

▲

dahfizz a year ago | parent | prev | next [-]

The key is that blibbe is talking about switches. Modern switches can process packets at line rate.

If you're working in AWS, you almost certainly are hitting a router, which is comparably slower. Not to mention you are dealing with virtualized hardware, and you are probably sharing all the switches & routers along your path (if someone else's packet is ahead of yours in the queue, you have to wait).

▲

crest a year ago | parent | prev | next [-]

I assume 1-3 hops of modern switches without congestion. Given 100Gb/s lanes these numbers are possible if you get all the bottlenecks out of the way. The moment you hit a deep queue the latency explodes.

▲

jiggawatts a year ago | parent [-]

So, are you talking about theoretical latencies here based on bandwidths and cable lengths, or actual measured latencies end-to-end between hosts?

I know that "in principle" the physics of the cabling allows single digit microseconds, but I've never seen it anywhere near that low even with cross-over cables with zero switches in-path!

▲

eqvinox a year ago | parent [-]

You need high bandwidth links (time to get the entire packet across starts to matter), run on bare metal (or have very well working HW virtualisation support), and tune NIC parameters and OS processing appropriately. But it's practically achievable.

Switches in these scenarios (e.g. 25GE DC targeted) are pretty predictable and add <1μs (unless misconfigured)

	▲	jiggawatts a year ago \| parent [-]
		> But it's practically achievable. I've never seen this in practice. Maaaaybe with Infiniband and custom-written apps that use a proprietary SDK. I'd love to see references to actual benchmarks.

▲

a year ago | parent | prev | next [-]

[deleted]

▲

blibble a year ago | parent | prev [-]

that's because cloud networks are complete shit

this is xilinux/mellanox cards with kernel bypass and cut-through switches with busy-waiting

in reality, in a prod system

	▲	jiggawatts a year ago \| parent [-]
		Both Azure and AWS have kernel-bypass, and they use 100 to 200 Gbps NICs that are either bespoke silicon or have onboard FPGAs for offloading various things such as encryption and packet header rewrites. I wouldn't rate them as "complete shit".

▲

joeblubaugh a year ago | parent | prev | next [-]

It’s really frustrating that the HotOS paper itself has no details about the benchmarking, and the blog post just says “redis benchmark”. What was the system setup? Persistence options? What was ported to demikernel? The client writing, the server reading from the NIC? Based on the problem specified in the paper, I assume its reading from the NIC that was implemented in DemiOS

▲

FridgeSeal a year ago | parent | prev | next [-]

This is a super cool idea, and it’s something that sounds fun to play with/try out.

Therefore, I eagerly await the inevitable influx of:

- “you don’t need it”

- “you’re not FAANG enough to justify it”,

-“seems overly complicated my Python-on-Ubuntu-is-good-enough and who needs more”

Style comments telling us why we shouldn’t have fun things like this.

Anyone got anymore comments to add to the bingo-card?

	▲	wmf a year ago \| parent \| next [-]
		Preemptive cynicism is even worse than regular cynicism.
	▲	dijksterhuis a year ago \| parent \| prev [-]
		if you personally want to play with it, go ahead. i think my personal feeling is that those sorts of comments you listed come out of the woodwork more when the comments section starts turning into an "oh man, this should be the standard for everyone" kind of discussion, which is never the case and is usually the point of those kinds of replies. at least they are when i reply with those kinds of comments anyway

▲

kd913 a year ago | parent | prev | next [-]

What is being asked for already exists? It is called Onload.

https://github.com/Xilinx-CNS/onload

▲

a-dub a year ago | parent [-]

it is my understanding that io_uring is the generalized open source implementation of this, although i do not think it bypasses the kernel fib trie like openonload does...

▲

gpderetta a year ago | parent [-]

Aside for onload being open source, not really. AF_XDP is the generalized, hardware agnostic, version of kernel bypass.

In addition to bypass onload also provides a full IP/TCP user space stack and non-intrusive support for existing binaries using the standard BSD socket interface (incidentally onload also supports XDP now).

io_uring is really for asynchronous communication with the kernel.

	▲	a-dub a year ago \| parent [-]
		interesting, didn't know that the networking stack had ring buffer infrastructure as well. (i don't think this af_xdp stuff existed when i was in this world) the fib trie is the core of the ip stack - i was using it as proxy for total ip stack bypass.

▲

secondcoming a year ago | parent | prev | next [-]

I looked at using DPDK on some of our GCP instances but it requires setting up a second VPC, which was one hurdle too much.

I’m hoping that io_uring makes all of this unnecessary anyway.

I recall reading a paper where someone noticed that for every packet the Linux kernel receives it has to check if any application has opened a raw socket. Raw sockets are initially needed to allow DHCP to work, so once your machine has been assigned an IP address you can (probably) turn this service off and so give the kernel less work to do. (My memory of the exact details may be sketchy).

▲

Polizeiposaune a year ago | parent | next [-]

DHCP issues address leases, not permanent assignments; leases have an expiration time (and earlier suggested renewal/rebind times). So the DHCP client must periodically renew -- if the tenant doesn't renew (perhaps because the DHCP client has been disabled), the DHCP service may lease the address to another tenant.

If the DHCP server hasn't moved to a new address this renewal can be done over unicast using the leased address - however, if the client doesn't receive a response from the server the client state machine will eventually discard the leased address and fall back to broadcast with an all-zeros source address (which is presumably what requires a raw socket).

The DHCP client implementation in question likely keeps the raw socket open for potential future use in this case. A client might be able to close the raw socket and reopen it later (but security folks might also want it to drop the privilege required to reopen the raw socket, and it might be hard to have an ironclad guarantee that the raw socket can be reopened later on a machine that's short on free kernel memory..).

	▲	secondcoming a year ago \| parent [-]
		Not on GCP's GCE at least

▲

Matthias247 a year ago | parent | prev [-]

io_uring reduces the overhead of system calls - but it doesn't do anything to reduce the overhead of the actual networking stack.

If your send/receive calls spend most CPU time in going through routing/fragmentation/filter/BPF/etc path in the networking stack, then uring (or other APIs which just reduce the system call overhead, like SendMmsg/Recvmmsg for UDP) might only make a small difference. Source: Lots of profiling while implementing QUIC libraries.

An alternative to DPDK that allows to bypass the kernel networking stack would be AF_XDP.

▲

r00tbeer a year ago | parent | prev | next [-]

See https://irenezhang.net/papers/demikernel-sosp21.pdf for a more thorough paper on the Demikernel from 2021. There are some great ideas for improving the kernel interface while still allowing efficient DPDK-style pipelines.

▲

crest a year ago | parent | prev | next [-]

For a such an interface to be feasible to support in common open source infrastructure it needs a pure software implementation for testing and development purposes. Even better something along the lines of coz to even model performance by throttling down everything else proportionally.

▲

Gollapalli a year ago | parent | prev | next [-]

This is great! I think that there are a lot of latency sensitive applications which really do need to spare the kernel latency.

▲

a year ago | parent | prev [-]

[deleted]