Remix.run Logo
magicalhippo 6 days ago

I have a 5070 Ti and a 2080 Ti, but running Windows so roughly 25-26 GB available. With Flash Attention enabled, I can just about squeeze in Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL from Unsloth with 64k context entirely on the GPUs.

With a 3090 I guess you'd have to reduce context or go for a slightly more aggressive quantization level.

Summarizing llama-arch.cpp which is roughly 40k tokens I get ~50 tok/sec generation speed and ~14 seconds to first token.

For short prompts I get more like ~90 tok/sec and <1 sec to first token.