| ▲ | quotemstr 3 hours ago | ||||||||||||||||
Linux is unusual in OS kernels in that direct system calls from arbitrary userspace code are supported and ABI-stable. This model has always been a terrible idea. It robs the system of an ability to intercept system calls in userspace before doing an expensive privilege-mode transition. If, instead, as on OpenBSD, the kernel enforced the rule that all system calls had to go through libc (or perhaps a big ntdll.dll-like VDSO), then the whole problem the linked article tries in vain to solve would disappear. If you wanted to hook a system call, you'd just change the libc/VDSO dispatch. No need to rewrite any instructions. If I were Linus, I'd make a new rule: starting today, all new system calls must go through VDSO. No exceptions. SYSCALL from anywhere else? SIGKILL. This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works. | |||||||||||||||||
| ▲ | razighter777 24 minutes ago | parent | next [-] | ||||||||||||||||
Direct system calls are an amazing idea. The NtDll and bsd models are worse. The whole libc becomes a security boundary without the protection of kernel space. So much windows malware and process tampering happens because now you have a library (ntdll) fully in userspace that is given special privileges, which now becomes a huge attack surface. Then you have to deal with breakages between the built in libc versions and the kernel This syscall overhead isn't as much as you suppose it is; for workloads where the syscall overhead actually makes a difference there are robust low-syscall paths for io/latency sensitive operations with DPDK, io_uring, and futex being a few examples. And there are robust performant methods on linux for syscall interception/tracing, see seccomp unotify, bpf tracepoints, ftrace. | |||||||||||||||||
| ▲ | yjftsjthsd-h 3 hours ago | parent | prev | next [-] | ||||||||||||||||
> This model has always been a terrible idea. It robs the system of an ability to intercept system calls in userspace before doing an expensive privilege-mode transition. This model has always been a trade-off. It has downsides, but it also has upsides, including an immense boost in flexibility; decoupling from any particular userspace is useful. > This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works. Can you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly. | |||||||||||||||||
| |||||||||||||||||
| ▲ | throwaway7356 2 hours ago | parent | prev | next [-] | ||||||||||||||||
> all system calls had to go through libc (or perhaps a big ntdll.dll-like Which makes containers crap on Windows and *BSD as they have to run the currect libc or equivalent. Thus you need to build a different container per OS version which sucks compared to Linux. | |||||||||||||||||
| |||||||||||||||||
| ▲ | freestanding 2 hours ago | parent | prev | next [-] | ||||||||||||||||
thats why OpenBSD is unconvinient for development - because it binds to libc bloatware | |||||||||||||||||
| |||||||||||||||||
| ▲ | Gualdrapo 3 hours ago | parent | prev [-] | ||||||||||||||||
> If I were Linus, I'd make a new rule Or, you know, just propose your idea to him | |||||||||||||||||
| |||||||||||||||||