| ▲ | bachmeier 2 days ago | ||||||||||||||||||||||||||||
So is there something I can take from that table if I have a 24 GB video card? I'm honestly not sure how to use those numbers. | |||||||||||||||||||||||||||||
| ▲ | GistNoesis 2 days ago | parent [-] | ||||||||||||||||||||||||||||
I just tried with llama.cpp RTX4090 (24GB) GGUF unsloth quant UD_Q4_K_XL You can probably run them all. G4 31B runs at ~5tok/s , G4 26B A4B runs at ~150 tok/s. You can run Q3.5-35B-A3B at ~100 tok/s. I tried G4 26B A4B as a drop-in replacement of Q3.5-35B-A3B for some custom agents and G4 doesn't respect the prompt rules at all. (I added <|think|> in the system prompt as described (but have not spend time checking if the reasoning was effectively on). I'll need to investigate further but it doesn't seem promising. I also tried G4 26B A4B with images in the webui, and it works quite well. I have not yet tried the smaller models with audio. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||