OS paging would be significantly worse here. The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch. You stall on every fault, wait for the 4KB/16KB page to load, then resume. With 80 layers of dense FFN streaming, that's thousands of cold faults per token.

  What makes this approach faster is that the model's access pattern is completely deterministic during         
  inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
  you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
  The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."

  For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
  then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
  expert 7. The neuron cache here is basically a domain-specific replacement policy.

▲ zozbot234 2 days ago | parent | next [-]

> The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch.

man 2 madvise

▲

astrange a day ago | parent [-]

That works for readahead but it's not good for random access. readv, aio, dispatch_io are better there.

▲

zozbot234 a day ago | parent [-]

This claim is a bit apples and oranges (no pun intended!). madvise is all about providing hints to the kernel to tune the page cache and readahead (including possibly disabling readahead altogether). it's not about performing reads into private memory buffers, which is actually where the options you mentioned fit in.

	▲	astrange 17 hours ago \| parent [-]
		Triggering reads is also how you get pages into the page cache, so it helps to know how to do it.

▲ EnPissant 2 days ago | parent | prev | next [-]

That assumes you have significant work to do between fetches (so you can prefetch while using the current data). With LLM decode you don't.

▲ 2 days ago | parent | prev [-]

[deleted]