Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

▲

Glemllksdf 9 hours ago | parent [-]

Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

	▲	dragonwriter 8 hours ago \| parent [-]
		Pretty sure all partial offload systems I’ve seen work by layers, but there might be something else out there.