| ▲ | zozbot234 17 hours ago | |||||||
> And there's no room for kv, so you'll OOM around 4k of context. Can't you offload KV to system RAM, or even storage? It would make it possible to run with longer contexts, even with some overhead. AIUI, local AI frameworks include support for caching some of the KV in VRAM, using a LRU policy, so the overhead would be tolerable. | ||||||||
| ▲ | tcdent 17 hours ago | parent | next [-] | |||||||
Not worth it. It is a very significant performance hit. With that said, people are trying to extend VRAM into system RAM or even NVMe storage, but as soon as you hit the PCI bus with the high bandwidth layers like KV cache, you eliminate a lot of the performance benefit that you get from having fast memory near the GPU die. | ||||||||
| ||||||||
| ▲ | bastawhiz 13 hours ago | parent | prev | next [-] | |||||||
The performance already isn't spectacular with it running all in vram. It'll obviously depend on the model: MoE will probably perform better than a dense model, and anything with reasoning is going to take _forever_ to even start beginning its actual output. | ||||||||
| ▲ | ranger_danger 17 hours ago | parent | prev [-] | |||||||
I know llama.cpp can, it certainly improved performance on my RAM-starved GPU. | ||||||||