Remix.run Logo
jrandolf 10 hours ago

vLLM handles GPU scheduling, not sllm. The model weights stay resident in VRAM permanently so there's no loading/unloading per request. vLLM uses continuous batching, so incoming requests are dynamically added to the running batch every decode step and the GPU is always working on multiple requests simultaneously. There is no "load to VRAM and run" per request; it's more like joining an already-running batch.

TTFT is under 2 seconds average. Worst case is 10-30s.

kaoD 8 hours ago | parent [-]

> The model weights stay resident in VRAM permanently so there's no loading/unloading per request.

Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right?

If I keep sending large context buffers, will that hog the batches?

8 hours ago | parent | next [-]
[deleted]
jrandolf 7 hours ago | parent | prev [-]

Not if you are the only one. We have rate limits to prevent this in case, idk, you share your key with 1000 people lol.