Inference is essentially a very complex matrix algorithm run repeatedly on itself, each time the input matrix (context window) is shifted and the new generated tokens appended to the end. So, it's easy to multiplex all active sessions over limited hardware, a typical server can hold hundreds of thousands of active contexts in the main system ram, each less than 500KB and ferry them to the GPU nearly instantaneously as required.

▲

apitman 7 days ago | parent [-]

I was under the impression that context takes up a lot more VRAM than this.

	▲	cornholio 7 days ago \| parent [-]
		The context after application of the algorithm is just text, something like 256k input tokens, each token representing a group of roughly 2-5 characters, encoded into 18-20 bits. The active context during inference, inside the GPUs, explodes each token into a 12288 dimensions vector, so 4 orders of magnitude more VRAM, and is combined with the model weights, Gbytes in size, across multiple parallel attention heads. The final result are just more textual tokens, which you can easily ferry around main system RAM and send to the remote user.