▲ | bigyabai 4 days ago | |||||||
The model is 80b parameters, but only 3b are activated during inference. I'm running the old 2507 Qwen3 30B model on my 8gb Nvidia card and get very usable performance. | ||||||||
▲ | coolspot 4 days ago | parent | next [-] | |||||||
Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token. | ||||||||
| ||||||||
▲ | jwr 4 days ago | parent | prev [-] | |||||||
I understand that, but whether it's usable depends on whether ollama can load parts of it into memory on my Mac, and how quickly. | ||||||||
|