| ▲ | kiratp 2 hours ago | |
By caching they mean “cached in GPU memory”. That’s a very very scarce resource. Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere. Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example) | ||