| ▲ | topspin 2 days ago |
| In your high level "You might not want to use it if" points, you mention Docker but not why, and that's odd. I happen to know why: io_uring syscalls are blocked by default in Docker, because io_uring is a large surface area for attacks, and this has proven to be a real problem in practice. Others won't know this, however. They also won't know that io_uring is similarly blocked in widely used cloud sandboxes, Android, and elsewhere. Seems like a fine place to point this stuff out: anyone considering io_uring would want to know about these issues. |
|
| ▲ | melhindi 2 days ago | parent | next [-] |
| Very good point! You’re absolutely right: The fact that io_uring is blocked by default in Docker and other sandboxes due to security concerns is important context, and we should have mentioned it explicitly there. We'll update the post, and happy to incorporate any other caveats you think are worth calling out. |
| |
|
| ▲ | abc123def456 18 hours ago | parent | prev | next [-] |
| Do you know if this still applies if you run a docker container with host networking enabled? |
|
| ▲ | hayd 2 days ago | parent | prev [-] |
| Is this something likely to ever change? |
| |
| ▲ | topspin 2 days ago | parent | next [-] | | I believe it's possible, but that it's a hard problem requiring great effort. I believe this is a opportunity to apply formal methods ah la seL4, that nothing less will be sufficient, and that the value of io_uring is great enough to justify it. That will take a lot of talent and hours. I admire io_uring. I appreciate the fact that it exists and continues despite the security problems; evidence that security "concerns" don't (yet) have a veto over all things Linux. The design isn't novel. High performance hardware (NICs, HBAs, codecs, etc.) have used similar techniques for a long time. Io_uring only brings this to user space and generalizes it. I imagine an OS and hardware that fully inculcate the pattern, obviating the need for context switches, interrupts, blocking and other conventional approaches we've slouched into since the inception of computing. | | |
| ▲ | quotemstr 2 days ago | parent [-] | | Alternatively, it requires cloud providers and such losing business if they refuse to support the latest features. The "surface area" argument against io_uring can apply to literally any innovation. Over on LWN, there's an article on path traversal difficulties that mentions people how, because openat2(2) is often banned as inconvenient to whitelist using seccomp, eople have to work around path traversal bugs using fiddly, manual, and slow element-by-element path traversal in user space. Ridiculous security theater. A new system call had a vulnerability in 2010 and so we're never able to take practical advantage of new kernel features ever? (It doesn't help that gvisor refuses to acknowledge the modern world.) Great example of descending into a shitty equilibrium because the great costs of a bad policy are diffuse but the slight benefits are concentrated. The only effective lever is commercial pressure. All the formal methods in the world won't help when the incentive structure reinforces technical obstinacy. |
| |
| ▲ | charcircuit 2 days ago | parent | prev | next [-] | | It already did with the io_uring worker rewrite in 5.12 (2021) which made it much safer. https://github.com/axboe/liburing/discussions/1047 | | |
| ▲ | topspin a day ago | parent [-] | | I can't agree with this. There is ample evidence of serious flaws since 2021. I hate that. I wish it weren't true. But an objective analysis of the record demands that view. Here is a fun one from September (CVE-2025-39816): "io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths." That is an attackers wet dream right there: bump the length and exfiltrate sensitive data. And it wasn't just some short lived "Linus's branch" work no one actually ran: it existed for a time in, for example, Ubuntu 24.04 LTS (circa 2024 release date.) I just cherry picked that one from among many. |
| |
| ▲ | Asmod4n 2 days ago | parent | prev [-] | | It’s manageable with eBPF instead of seccomp so one has to adapt to that. Should be doable. | | |
| ▲ | georgyo 2 days ago | parent [-] | | Maybe not so doable. The whole point of io_uring is to reduce syscalls. So you end up just three. io_uring_setup, io_uring_register, io_uring_enter There is now a memory buffer that the user space and the kernel is reading, and with that buffer you can _always_ do any syscall that io_uring supports. And things like strace, eBPF, and seccomp cannot see the actual syscalls that are being called in that memory buffer. And, having something like seccomp or eBPF inspect the stream might slow it down enough to eat the performance gain. | | |
| ▲ | to_ziegler 2 days ago | parent | next [-] | | There is some interesting ongoing research on eBPF and uring that you might find interesting, e.g., RingGuard: Guarding io_uring with eBPF (https://dl.acm.org/doi/10.1145/3609021.3609304
). | |
| ▲ | Asmod4n 2 days ago | parent | prev | next [-] | | Ain’t eBPF hooks there so you can limit what a cgroup/process can do, not matter what API it’s calling. Like disallowing opening files or connecting sockets altogether. | |
| ▲ | actionfromafar 2 days ago | parent | prev [-] | | So io_uring is like transactions in sql but for syscalls? | | |
| ▲ | topspin a day ago | parent [-] | | No. A batch of submission queue entries (SQEs) can be partially completed, whereas an ACID database transaction is all or nothing. The syscalls performed by SQEs have side effects that can't reasonably be undone. Failures of operations performed by SQEs don't stop or rollback anything. Think of io_uring as a pair of unidirectional pipes. You shove syscalls and (pointers to) data into one pipe and the results (asynchronously) gush out of the other pipe, errors and all. Each pipe is actually a separate block of memory shared between your process and the kernel: you scribble in one and read from the other, and the kernel does the opposite. |
|
|
|
|