| ▲ | xienze 5 hours ago | ||||||||||||||||
It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode. | |||||||||||||||||
| ▲ | rahimnathwani 5 hours ago | parent | next [-] | ||||||||||||||||
There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom I'm curious which one you're using. | |||||||||||||||||
| |||||||||||||||||
| ▲ | msuniverse2026 5 hours ago | parent | prev [-] | ||||||||||||||||
I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days? | |||||||||||||||||
| |||||||||||||||||