▲ | Havoc 6 days ago | |
30B class model should run on a consumer 24gb card when quantised though would need pretty aggressive quant to make room for context. Don’t think you’ll get the full 256k context though So about 700 bucks for a 3090 on eBay | ||
▲ | magicalhippo 6 days ago | parent [-] | |
I have a 5070 Ti and a 2080 Ti, but running Windows so roughly 25-26 GB available. With Flash Attention enabled, I can just about squeeze in Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL from Unsloth with 64k context entirely on the GPUs. With a 3090 I guess you'd have to reduce context or go for a slightly more aggressive quantization level. Summarizing llama-arch.cpp which is roughly 40k tokens I get ~50 tok/sec generation speed and ~14 seconds to first token. For short prompts I get more like ~90 tok/sec and <1 sec to first token. |