new | show | ask | jobs Github

lagrange77 9 days ago

While you're here..

Do you guys know a website that clearly shows which OS LLM models run on / fit into a specific GPU(setup)?

The best heuristic i could find for the necessary VRAM is Number of Parameters × (Precision / 8) × 1.2 from here [0].

[0] https://medium.com/@lmpo/a-guide-to-estimating-vram-for-llms...

▲

philipkiely 8 days ago | parent | next [-]

Yeah we have tried to build calculators before it just depends so much.

Your equation is roughly correct, but I tend to multiply by a factor of 2 not 1.2 to allow for highly concurrent traffic.

▲

reactordev 9 days ago | parent | prev | next [-]

huggingface has this built in if you care to fill out your software and hardware profile here:

https://huggingface.co/settings/local-apps

Then on the model pages, it will show you whether you can use it.

▲

diggan 8 days ago | parent [-]

Interesting, never knew about that! I filled out my details, then went to https://huggingface.co/openai/gpt-oss-120b but I'm not sure if I see any difference? Where is it supposed to show if I can run it or not?

▲

reactordev 8 days ago | parent [-]

You’ll see green check next to models you can use on the model card.

https://huggingface.co/unsloth/gpt-oss-20b-GGUF

▲

diggan 8 days ago | parent [-]

Ah, it only works for GGUF, not for .safetensors (which the format HuggingFace themselves came up with :P ) ? I see the checks at https://huggingface.co/unsloth/gpt-oss-20b-GGUF but nothing at https://huggingface.co/openai/gpt-oss-120b, seems a bit backwards.

	▲	reactordev 7 days ago \| parent [-]
		For those kind of models, you know if you can run them. :D Also most of the times they are split up and, sometimes, you’ll get an indicator on the splits. It’s still a work in progress to check all hardware and model format compatibility but it’s a great start until GGUF becomes the standard.

▲

diggan 9 days ago | parent | prev | next [-]

Maybe I'm spoiled by having great internet connection, but I usually download the weights and try to run them via various tools (llama.cpp, LM Studio, vLLM and SGLang typically) and see what works. There seems to be so many variables involved (runners, architectures, implementations, hardware and so on) that none of the calculators I've tried so far been accurate, both in the way that they've over-estimated and under-estimated what I could run.

So in the end, trying to actually run them seems to be the only fool-proof way of knowing for sure :)

▲

lagrange77 8 days ago | parent | prev [-]

Thanks for your answers!

While it is seemingly hard to calculate it, maybe one should just make a database website that tracks specific setups (model, exact variant / quantisation, runner, hardware) where users can report, which combination they got running (or not) along with metrics like tokens/s.

Visitors could then specify their runner and hardware and filter for a list of models that would run on that.

	▲	diggan 8 days ago \| parent [-]
		Yeah, what you're suggesting sounds like it could be more useful than the "generalized calculators" people are currently publishing and using.