Remix.run Logo
Havoc 6 days ago

30B class model should run on a consumer 24gb card when quantised though would need pretty aggressive quant to make room for context. Don’t think you’ll get the full 256k context though

So about 700 bucks for a 3090 on eBay

magicalhippo 6 days ago | parent [-]

I have a 5070 Ti and a 2080 Ti, but running Windows so roughly 25-26 GB available. With Flash Attention enabled, I can just about squeeze in Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL from Unsloth with 64k context entirely on the GPUs.

With a 3090 I guess you'd have to reduce context or go for a slightly more aggressive quantization level.

Summarizing llama-arch.cpp which is roughly 40k tokens I get ~50 tok/sec generation speed and ~14 seconds to first token.

For short prompts I get more like ~90 tok/sec and <1 sec to first token.