Remix.run Logo
quotemstr 5 hours ago

I've never understood the fascination some people have with mmap. Memory-mapped file IO is just a RAM cache combined with a hidden system call (a page fault) to fill the cache. You can do the same thing yourself by using O_DIRECT to fill regular anonymous memory. If you're feeling social, you can fill a mapped and shared memfd.

You can seal memfds too, which means that the "read-only" mode is easy to implement: just map your memfd for write, apply F_SEAL_FUTURE_WRITE, and share the memfd to anyone you want to have read-only access.

By doing your own O_DIRECT IO instead of relying on the kernel's defaults, you get a lot more control. You choose how much readahead to do; you choose your read-cluster size. You choose your cache eviction strategy. You choose when to write back.

BTW: O_DIRECT can also be done asynchronously using aio or io_uring. There's no such thing as an asynchronous page fault. And IO errors? Would you rather deal with EIO or SIGBUS?

Why would you want the kernel to do these things for you? It'll do a worse job: it has less information than you do and has to use blunt heuristics that work sort-of-good-enough for the whole world, not just your program.

And it's not any faster either. O_DIRECT is DMA. A page cache fill is also DMA. It's the same operation, spelled differently.

wmanley 4 hours ago | parent | next [-]

I use mmap with my SQLite database[1] because I have many concurrent SQLite connections (one per concurrent HTTP request) and I don't want each connection to have its own 2MB cache[2]. It's better that all the connections simply share the page cache.

[1]: https://sqlite.org/pragma.html#pragma_mmap_size

[2]: https://sqlite.org/pragma.html#pragma_cache_size

teravor 4 hours ago | parent | prev | next [-]

with mmap you also don't have to worry about committing too much system memory, if another application needs it it will start evicting your cache.

quotemstr 4 hours ago | parent [-]

You're right about that.

Linux needs a way for userspace processes to participate in the kernel's shrinker system for reclaiming memory under pressure. Watching memory PSI is too coarse. MADV_FREE is too complicated and indiscriminate. You could imagine a notification FD, but then you've just reinvented PSI. You could imagine a synchronous signal, but everyone hates signals and won't couple any new functionality to them.

Shrinker-BPF attached to a memfd perhaps? A BPF shrinker could not only choose which pages to evict in a non-stupid way, but could notify userspace in some sane manner (e.g. setting a bitmask somewhere) that it's done so.

(Zero-fill as "notification" is insane and doesn't actually work because zero is a perfectly valid value in a lot of contexts.)

bagxrvxpepzn 5 hours ago | parent | prev | next [-]

> I've never understood the fascination some people have with mmap.

Uncommonly used system calls give user-space programmers the sensation of learning something.

> Why would you want the kernel to do these things for you? It'll do a worse job: it has less information than you do and has to use blunt heuristics that work sort-of-good-enough for the whole world, not just your program.

Yes, you're opting into non-determinism you don't control. When resources get constrained and everything can't be in memory and someone asks you why the database sucks, all you'll be able to do is shrug. Anyone who builds critical systems would never rely on the kernel making decisions like this. Don't use LMDB for anything that matters.

jnwatson 3 hours ago | parent [-]

You're already depending on the OS for many other things. Depending on it for page caching is just one more thing.

bagxrvxpepzn 3 hours ago | parent [-]

This level of reasoning is insufficient when building reliable systems. The consequences of depending on the OS for page caching are different than the consequences of depending on it for device drivers, file systems, or scheduling.

ok123456 5 hours ago | parent | prev [-]

The OS handles all of that transparently, without requiring any additional code. I think that is the draw.

quotemstr 5 hours ago | parent [-]

And that's adequate for casual programs. LMDB is big and serious enough to warrant the extra complexity (which, to be fair, is significant) of userspace buffer management. LMDB does the work once and all users benefit.