| ▲ | timschmidt 7 hours ago | |
Even with MoE, holding the model in RAM while individual experts are evaluated in VRAM is a bit of a compromise. Experts can be swapped in and out of VRAM for each token. So RAM <-> VRAM bandwidth becomes important. With a model larger than RAM, that bandwidth bottleneck gets pushed to the SSD interface. At least it's read-only, and not read-write, but even the fastest of SSDs will be significantly slower than RAM. That said, there are folks out there doing it. https://github.com/lyogavin/airllm is one example. | ||
| ▲ | nick49488171 34 minutes ago | parent [-] | |
With a non-sequential generative approach perhaps the RAM cache misses could be grouped together and swapped on a when available/when needed prioritized bases. | ||