Remix.run Logo
zargon 4 days ago

The best VRAM calculator I have found is https://apxml.com/tools/vram-calculator. It is much more thorough than this one. For example, it understands different models' attention schemes for correct KV cache size calculation, and supports quantization of both the model and the KV cache. Also, fine-tuning. It has its own limitations, such as only supporting specific models. In practice though, the generic calculators are not very useful because model architectures vary (mainly the KV cache) and end up being way off. (Not sure whether or not it would be better to discuss it separately, but I submitted it at https://news.ycombinator.com/item?id=44677409)

oktoberpaard 4 days ago | parent | next [-]

It gives weird results for me. I’m using Qwen3-32B with 32K context length at Q4_K_M, with 8 bit KV cache fully offloaded to 24GB VRAM. According to this calculator this should be impossible by a large margin, yet it’s working for me.

Edit: this might be because I’ve got flash attention enabled in Ollama.

zeroq 4 days ago | parent | prev | next [-]

This one is indeed much better and it instantly answers my immediate feedback I wanted to leave for the one originally posted, which is - instead of calculating an artificial scenario I would like to state what can I run on the hardware I actually have at hand. Thanks!

jwrallie 4 days ago | parent | prev | next [-]

Nice! I could have saved so much time downloading models to do trial end error with this.

yepyip 4 days ago | parent | prev [-]

Somehow you have to login now, to use it. It wasn't like this a few weeks ago...

mdaniel 4 days ago | parent [-]

That is not my experience, maybe your IP is flagged as hammering their site?

yepyip 3 days ago | parent [-]

Oh, I wasn't aware of this. But how can you hammer a calculator? Yes, I have used it like 50 times, checking how big would be a Q4, or smaller model with different batch sizes and concurrent users. Do you think it is a heavy calculation?