Remix.run Logo
Show HN: WayInfer – Native GGUF engine that runs models larger than your RAM
1 points by ahmedm24 6 hours ago

We built a native inference engine that runs quantized LLMs directly from SSD using memory-mapped I/O. The model never fully loads into RAM — the OS pages weights on demand as each layer executes.

*What it does:* - Mixtral 8x22B (80GB, 141B params) runs on a machine with 48GB RAM - Model loads in 0.3 seconds (vs 190s with llama.cpp) - Produces correct output: "What is 2+2?" → "The sum of 2 and 2 is 4." - Zero dependencies — custom tensor engine, custom GGUF parser, no ggml/llama.cpp

*How it works:* - `mmap()` the GGUF file. The OS handles SSD→RAM paging transparently - Quantize the input to Q8_K, compute dot products directly against Q4_K/Q5_K/Q6_K weights in the quantized domain — no dequantization to float32 - AVX2 SIMD + 8-thread parallel matvec - For MoE models: only 2 of 8 experts are active per token, so most weights stay cold on disk

*The hard part we solved:* GGUF models are calibrated for a specific dot product computation path (ggml's "quantize input → integer multiply-accumulate → late float conversion"). If you naively dequantize weights to float32 and do a standard dot product, the per-operation error is tiny (~0.001%) but compounds across 56 transformer layers into completely wrong output. We had to reverse-engineer and match ggml's exact scalar computation — block-level integer accumulation with 8-lane parallel reduction — to get correct results.

*What it doesn't do (yet):* - Speed: ~0.08 tok/s on the 80GB model (CPU-only, no GPU offload) - No interactive chat UI - Only K-quant GGUF formats (Q4_K_M, Q5_K_M, Q6_K — covers ~90% of models on HuggingFace) - Windows only (Linux stubs exist but untested)

The architecture comes from my "work in progress" WayOS (https://github.com/cloudlinqed/WayOS), an AI-first OS that treats SSD/RAM/VRAM as a unified memory hierarchy.

GitHub: https://github.com/cloudlinqed/WayInfer