| ▲ | dist-epoch a day ago | |
After you load the weights into the GPU and keep the KV cache there too, you don't need any other significant traffic. | ||
| ▲ | numpad0 a day ago | parent [-] | |
Even in tensor parallel modes? I thought it could only work if you're fine stalling all but n GPU for n users at any given moments. | ||