▲ | codazoda 3 days ago | |
Tell us more? | ||
▲ | DiabloD3 3 days ago | parent [-] | |
Nothing really to say, its just like everyone else's inference setups. Select a model that produces good results, has anywhere from 256k to 1M context (ex: Qwen3-Coder can do 1M), is under one of the acceptable open weights licenses, and run it in llama.cpp. llama.cpp can split layers between active and MoE, and only load the active ones into vram, leaving the rest of it available for context. With Qwen3-Coder-30B-A3B, I can use Unsloth's Q4_K_M, consume a mere 784MB of VRAM with the active layers, then consume 27648MB (kv cache) + 3096MB (context) with the kv cache quantized to iq4_nl. This will fit onto a single card with 32GB of VRAM, or slightly spill over on 24GB. Since I don't personally need that much, I'm not pouring entire projects into it (I know people do this, and more data does not produce better results), I bump it down to 512k context and fit it in 16.0GB, to avoid spill over on my 24GB card. In the event I do need the context, I am always free to enable it. I do not see a meaningful performance difference between all on the card and MoE sent to RAM while active is on VRAM, its very much a worthwhile option for home inference. Edit: For completeness sake, 256k context with this configuration is 8.3GB total VRAM, making _very_ budget good inference absolutely possible. |