| ▲ | Kayou 3 hours ago | |
Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had | ||
| ▲ | segmondy 3 hours ago | parent | next [-] | |
llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram. | ||
| ▲ | Koffiepoeder an hour ago | parent | prev | next [-] | |
The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models. | ||
| ▲ | Maxious 3 hours ago | parent | prev | next [-] | |
Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/ | ||
| ▲ | nurettin an hour ago | parent | prev [-] | |
This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage. | ||