| ▲ | drewg123 19 hours ago | ||||||||||||||||||||||||||||||||||
I come from a very different world (optimizing the FreeBSD kernel for the Netflix CDN, running on bare metal) but performance leaps like this are fascinating to me. One of the things that struck me when reading this with only general knowledge of the linux kernel is: What makes things so terrible? Is iptables really that bad? Is something serialized to a single core somewhere in the other 3 scenarios? Is the CPU at 100% in all cases? Is this TCP or UDP traffic? How many threads is iperf using? It would be cool to see the CPU utilization of all 4 scenarios, along with CPU flamegraphs. | |||||||||||||||||||||||||||||||||||
| ▲ | toprerules 19 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
In the case of XDP, the reason it's so much faster is that it requires 0 allocations in the most common case. The DMA buffers are recycled in a page pool that's already allocated and mapped at least queue depth buffers for each hardware queue. XDP is simply running on the raw buffer data, then telling the driver what the user wants to do with the buffer. If all you are doing is rewriting an IP address, this is incredibly fast. In the non XDP case (ebpf on TC) you have to allocate a sk buff and initialize it. This is very expensive, there's tons of accounting in the struct itself, and components that track every sk buff. Then there are the various CPU bound routing layers. Overall the network core of Linux is very efficient. The actual page pool buffer isn't copied until the user reads data. But there's a million features the stack needs to support, and all of these cost efficiency. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | shivanshvij 19 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
As far as we can tell, it’s a mixture of a lot of things. One of the questions I got asked was how useful this is if you have a smaller performance requirement than 200Gbps (or, maybe a better way to put it, what if your host is small and can only do 10Gbps anyways). You’ll have to wait for the follow up post with the CNI plugin for the full self-reproducible benchmark, but on a 16 core EC2 instance with a 10Gbps connection IPtables couldn’t do more than 5Gbps of throughput (TCP!), whereas again XDP was able to do 9.84Gbps on average. Furthermore, running bidirectional iPerf3 tests in the larger hosts shows us that both ingress and egress throughput increase when we swap out iptables on just the egresss path. This is all to say, our current assumption is when the CPU is thrashed by iPerf3, the RSS queues, the Linux kernel’s ksoftirqd threads, etc. all at once it destroys performance. XDP is moving some of the work outside the kernel, while at the same time the packet is only processed through the kernel stack half as much as without XDP (only on the path before or after the veth). It really is all CPU usage in the end as far as I can tell. It’s not like our checksumming approach is any better than what the kernel already does. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | tuetuopay 18 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
The kernel will allocate, merge packets in skbs if needed, extract data, and do quite a lot. XDP runs as early as possible in the datapath. Pretty much all drivers have to do is call the XDP code when they receive an IRQ from the NIC. You'll bypass a memory copy (ringbuf -> kernel memory), allocations (skb), parsing (ips & such), firewalling, checking if the packet is local, checksum validation, the list goes on... The following diagram helps seeing all the things that happens: https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilte... (yes, xdp is the leftmost step, literally after "card dma'd packet in memory") | |||||||||||||||||||||||||||||||||||
| ▲ | tux1968 19 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
It's also a bit depressing that everyone is still using the slower iptables, when nftables has been in the kernel for over a decade. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||