| ▲ | Aurornis 9 hours ago | |||||||
Additional VRAM is needed for context. This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though. | ||||||||
| ▲ | Glemllksdf 9 hours ago | parent [-] | |||||||
Isn't that some kind of gambling if you offload random experts onto the CPU? Or is it only layers but that would affect all Experts? | ||||||||
| ||||||||