Remix.run Logo
celrod 3 hours ago

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

ggerganov 3 hours ago | parent [-]

As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.

[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...