▲ | cornholio 7 days ago | |||||||
Inference is essentially a very complex matrix algorithm run repeatedly on itself, each time the input matrix (context window) is shifted and the new generated tokens appended to the end. So, it's easy to multiplex all active sessions over limited hardware, a typical server can hold hundreds of thousands of active contexts in the main system ram, each less than 500KB and ferry them to the GPU nearly instantaneously as required. | ||||||||
▲ | apitman 7 days ago | parent [-] | |||||||
I was under the impression that context takes up a lot more VRAM than this. | ||||||||
|