Remix clone Hacker News

new | show | ask | jobs Github

	▲	timschmidt 7 hours ago
		Even with MoE, holding the model in RAM while individual experts are evaluated in VRAM is a bit of a compromise. Experts can be swapped in and out of VRAM for each token. So RAM <-> VRAM bandwidth becomes important. With a model larger than RAM, that bandwidth bottleneck gets pushed to the SSD interface. At least it's read-only, and not read-write, but even the fastest of SSDs will be significantly slower than RAM. That said, there are folks out there doing it. https://github.com/lyogavin/airllm is one example.
	▲	nick49488171 34 minutes ago \| parent [-]
		With a non-sequential generative approach perhaps the RAM cache misses could be grouped together and swapped on a when available/when needed prioritized bases.