| ▲ | superkuh 9 days ago | |||||||
>consumer-grade card with 12G of VRAM and got 5t/s That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend. | ||||||||
| ▲ | senko 9 days ago | parent | next [-] | |||||||
Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?). I should play a bit more with llama.cpp options and see what bappened there. Thanks! | ||||||||
| ||||||||
| ▲ | pja 8 days ago | parent | prev [-] | |||||||
The 8 bit quant runs at 36tps using Vulkan on my AMD rx9070. | ||||||||