Remix clone Hacker News

new | show | ask | jobs Github

	▲	gonzalohm 4 hours ago
		Did you double the tokens per second by adding a second GPU or was the increase significantly less?
	▲	horsawlarway 4 hours ago \| parent \| next [-]
		No real change in inference speed. It basically just allows me to slot in more context or a bigger model. A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM. Sometimes that matters, a lot of times it doesn't. On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures. I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).
	▲	mirekrusin 4 hours ago \| parent \| prev [-]
		You’re adding extra gpu for more vram, not speed.