Remix clone Hacker News

new | show | ask | jobs Github

	▲	magicalhippo 6 days ago
		I have a 5070 Ti and a 2080 Ti, but running Windows so roughly 25-26 GB available. With Flash Attention enabled, I can just about squeeze in Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL from Unsloth with 64k context entirely on the GPUs. With a 3090 I guess you'd have to reduce context or go for a slightly more aggressive quantization level. Summarizing llama-arch.cpp which is roughly 40k tokens I get ~50 tok/sec generation speed and ~14 seconds to first token. For short prompts I get more like ~90 tok/sec and <1 sec to first token.