| ▲ | aegis_camera 4 hours ago | |||||||||||||||||||
We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro: TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead. SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse. By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon. Also tested QWEN 4B on IPHONE 13 Pro. Code and implementation details: https://github.com/SharpAI/SwiftLM | ||||||||||||||||||||
| ▲ | anemll 2 hours ago | parent | next [-] | |||||||||||||||||||
Check it out, you might be able to speed it up using this https://github.com/Anemll/anemll-flash-mlx https://x.com/anemll/status/2038684375425200360 | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | altruios 3 hours ago | parent | prev [-] | |||||||||||||||||||
what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md | ||||||||||||||||||||
| ||||||||||||||||||||