Remix clone Hacker News

new | show | ask | jobs Github

	▲	Havoc 6 days ago
		30B class model should run on a consumer 24gb card when quantised though would need pretty aggressive quant to make room for context. Don’t think you’ll get the full 256k context though So about 700 bucks for a 3090 on eBay
	▲	magicalhippo 6 days ago \| parent [-]
		I have a 5070 Ti and a 2080 Ti, but running Windows so roughly 25-26 GB available. With Flash Attention enabled, I can just about squeeze in Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL from Unsloth with 64k context entirely on the GPUs. With a 3090 I guess you'd have to reduce context or go for a slightly more aggressive quantization level. Summarizing llama-arch.cpp which is roughly 40k tokens I get ~50 tok/sec generation speed and ~14 seconds to first token. For short prompts I get more like ~90 tok/sec and <1 sec to first token.