Remix.run Logo
coolspot 4 days ago

Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.

drozycki 4 days ago | parent [-]

The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.