▲ | lagrange77 9 days ago | ||||||||||||||||||||||||||||||||||
While you're here.. Do you guys know a website that clearly shows which OS LLM models run on / fit into a specific GPU(setup)? The best heuristic i could find for the necessary VRAM is Number of Parameters × (Precision / 8) × 1.2 from here [0]. [0] https://medium.com/@lmpo/a-guide-to-estimating-vram-for-llms... | |||||||||||||||||||||||||||||||||||
▲ | philipkiely 8 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
Yeah we have tried to build calculators before it just depends so much. Your equation is roughly correct, but I tend to multiply by a factor of 2 not 1.2 to allow for highly concurrent traffic. | |||||||||||||||||||||||||||||||||||
▲ | reactordev 9 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
huggingface has this built in if you care to fill out your software and hardware profile here: https://huggingface.co/settings/local-apps Then on the model pages, it will show you whether you can use it. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
▲ | diggan 9 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Maybe I'm spoiled by having great internet connection, but I usually download the weights and try to run them via various tools (llama.cpp, LM Studio, vLLM and SGLang typically) and see what works. There seems to be so many variables involved (runners, architectures, implementations, hardware and so on) that none of the calculators I've tried so far been accurate, both in the way that they've over-estimated and under-estimated what I could run. So in the end, trying to actually run them seems to be the only fool-proof way of knowing for sure :) | |||||||||||||||||||||||||||||||||||
▲ | lagrange77 8 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
Thanks for your answers! While it is seemingly hard to calculate it, maybe one should just make a database website that tracks specific setups (model, exact variant / quantisation, runner, hardware) where users can report, which combination they got running (or not) along with metrics like tokens/s. Visitors could then specify their runner and hardware and filter for a list of models that would run on that. | |||||||||||||||||||||||||||||||||||
|