The model is 80b parameters, but only 3b are activated during inference. I'm running the old 2507 Qwen3 30B model on my 8gb Nvidia card and get very usable performance.

▲

coolspot 4 days ago | parent | next [-]

Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.

	▲	drozycki 4 days ago \| parent [-]
		The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.

▲

jwr 4 days ago | parent | prev [-]

I understand that, but whether it's usable depends on whether ollama can load parts of it into memory on my Mac, and how quickly.

	▲	bigyabai 3 days ago \| parent [-]
		I really do not suggest ollama. It is slow, missing tons of llama.cpp features and doesn't expose many settings to the user. Koboldcpp is a much better inference provider and even has an ollama-compatible API endpoint.