Remix clone Hacker News

new | show | ask | jobs Github

	▲	nubg a day ago
		how many tokens per second do you get?
	▲	usagisushi 16 hours ago \| parent \| next [-]
		Not the OP, but their setup must be faster than my 4060 16GB + 3060 12GB setup. Here are my numbers (typical values, N=1): `Model pp (t/s) tg (t/s) Qwen 3.6 27B 900 29 Qwen 3.6 35B-A3B 2100 85 Gemma 4 31B 750 28 Gemma 4 26B-A4B 2500 90` - All models: UD-Q4 w/ MTP. Context size: ~100k (MoE) / ~70k (Dense). - Layer splitting used. Tensor splitting is ~1.2x faster in TG, but power spikes from 150W to 380W.
	▲	cybertim a day ago \| parent \| prev [-]
		I bought two RTX3080s with 20GB during my holiday in china (set me back 700euros) I'm getting 800-1000 input tps and 60-100tps output with Qwen 3.6 27b Q8 (MTP, P2P, 200k context) this feels like opus4.5 level while coding (pi harness). Also easy to just host your own openai compatible api from home this way and still use your MacBook as dev station.