▲ | coolspot 4 days ago | |
Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token. | ||
▲ | drozycki 4 days ago | parent [-] | |
The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time. |