Remix.run Logo
aegis_camera 4 hours ago

We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro:

TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.

SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.

By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.

Also tested QWEN 4B on IPHONE 13 Pro.

Code and implementation details: https://github.com/SharpAI/SwiftLM

anemll 2 hours ago | parent | next [-]

Check it out, you might be able to speed it up using this https://github.com/Anemll/anemll-flash-mlx https://x.com/anemll/status/2038684375425200360

aegis_camera an hour ago | parent [-]

Thanks, pure Swift was the design idea and since I found nothing could be used for my project https://www.sharpai.org then I created Swift version. Python is too heavy to be delivered with application, user mentioned they want to use MLX, that's why I've been working on it for 1-2 weeks for bug fixing and testing , then suddenly TurboQuant proposed, I had a quick integration. My 64GB M5 Pro is already good for my local security task, now it's able to use M1/M2 Mini w/ 8GB memory.

altruios 3 hours ago | parent | prev [-]

what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md

aegis_camera an hour ago | parent | next [-]

https://www.sharpai.org/benchmark/ The MLX part is what we've done with SwiftLM, the local result is still being verified more details are on-going.

aegis_camera 2 hours ago | parent | prev | next [-]

I'll add more details. We just wired up the pipeline on both MAC and IOS.

gigatexal 2 hours ago | parent | prev [-]

yeah this I'd like to see added to teh readme.