| ▲ | zaat 2 days ago | ||||||||||||||||||||||||||||||||||
Thank you for your work. You have an answer on your page regarding "Should I pick 26B-A4B or 31B?", but can you please clarify if, assuming 24GB vRAM, I should pick a full precision smaller model or 4 bit larger model? | |||||||||||||||||||||||||||||||||||
| ▲ | petu 2 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
Try 26B first. 31B seems to have very heavy KV cache (maybe bugged in llama.cpp at the moment; 16K takes up 4.9GB). edit: 31B cache is not bugged, there's static SWA cost of 3.6GB.. so IQ4_XS at 15.2GB seems like reasonable pair, but even then barely enough for 64K for 24GB VRAM. Maybe 8 bit KV quantization is fine now after https://github.com/ggml-org/llama.cpp/pull/21038 got merged, so 100K+ is possible. > I should pick a full precision smaller model or 4 bit larger model? 4 bit larger model. You have to use quant either way -- even if by full precision you mean 8 bit, it's gonna be 26GB + overhead + chat context. Try UD-Q4_K_XL. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
| [deleted] | |||||||||||||||||||||||||||||||||||
| ▲ | danielhanchen 2 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
Thank you! I presume 24B is somewhat faster since it's only 4B activated - 31B is quite a large dense model so more accurate! | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||