I agree. To run an acceptable model (e.g. Qwen/Qwen3.6-27B or google/gemma-4-31B) with a good quantization (minimum Q5) with a good context size (min 64k) you could buy 2 or even 3 GTX 5060 16GiB VRAM for ~550$ each. Fyi the much faster MoE models were useless for my usecases - e.g not able to correctly identify me/I/you, endless thinking loops, etc.

I'm currently running those models using an RTX 5070 12GiB + RTX 5060 16GiB + RTX 3060 12GiB with a 96k context size with MTP/speculative decoding and I'm quite happy (the 5070 is about 4x faster than the 3060, the 5060 is inbetween them so about 2x faster than a 3060).

▲ eklavya a day ago | parent | next [-]

How are you running these together, splitting the model somehow or did you mean different models on any one card at a time?

▲ nubg a day ago | parent | prev [-]

how many tokens per second do you get?

	▲	usagisushi 16 hours ago \| parent \| next [-]
		Not the OP, but their setup must be faster than my 4060 16GB + 3060 12GB setup. Here are my numbers (typical values, N=1): `Model pp (t/s) tg (t/s) Qwen 3.6 27B 900 29 Qwen 3.6 35B-A3B 2100 85 Gemma 4 31B 750 28 Gemma 4 26B-A4B 2500 90` - All models: UD-Q4 w/ MTP. Context size: ~100k (MoE) / ~70k (Dense). - Layer splitting used. Tensor splitting is ~1.2x faster in TG, but power spikes from 150W to 380W.
	▲	cybertim a day ago \| parent \| prev [-]
		I bought two RTX3080s with 20GB during my holiday in china (set me back 700euros) I'm getting 800-1000 input tps and 60-100tps output with Qwen 3.6 27b Q8 (MTP, P2P, 200k context) this feels like opus4.5 level while coding (pi harness). Also easy to just host your own openai compatible api from home this way and still use your MacBook as dev station.