| ▲ | GistNoesis 2 days ago | |||||||
I just tried with llama.cpp RTX4090 (24GB) GGUF unsloth quant UD_Q4_K_XL You can probably run them all. G4 31B runs at ~5tok/s , G4 26B A4B runs at ~150 tok/s. You can run Q3.5-35B-A3B at ~100 tok/s. I tried G4 26B A4B as a drop-in replacement of Q3.5-35B-A3B for some custom agents and G4 doesn't respect the prompt rules at all. (I added <|think|> in the system prompt as described (but have not spend time checking if the reasoning was effectively on). I'll need to investigate further but it doesn't seem promising. I also tried G4 26B A4B with images in the webui, and it works quite well. I have not yet tried the smaller models with audio. | ||||||||
| ▲ | kpw94 2 days ago | parent | next [-] | |||||||
> I'll need to investigate further but it doesn't seem promising. That's what I meant by "waiting a few days for updates" in my other comment. Qwen 3.5 release, I remember a lot of complaints about: "tool calling isn't working properly" etc. That was fixed shortly after: there was some template parsing work in llama.cpp. and unsloth pulled out some models and brought back better one for improving something else I can't quite remember, better done Quantization or something... coder543 pointed out the same is happening regarding tool calling with gemma4: https://news.ycombinator.com/item?id=47619261 | ||||||||
| ||||||||
| ▲ | amarshall 2 days ago | parent | prev | next [-] | |||||||
If you are running on 4090 and get 5 t/s, then you exceeded your VRAM and are offloading to the CPU (or there is some other serious perf. issue) | ||||||||
| ▲ | mrinterweb a day ago | parent | prev [-] | |||||||
Thank you. I have the same card, and I noticed the same ~100 TPS when I ran Q3.5-35B-A3B. G4 26B A4B running at 150TPS is a 50% performance gain. That's pretty huge. | ||||||||