Remix clone Hacker News

new | show | ask | jobs Github

	▲	dist-epoch 7 hours ago
		NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s. Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.
	▲	jchw 6 hours ago \| parent [-]
		Unfortunately it really is running this slow with Llama.cpp, but of course that's with Vulkan mode. The VRAM capacity is definitely where it shines, rather than compute power. I am pretty sure that this isn't really optimal use of the cards, especially since I believe we should be able to get decent, if still sublinear, scaling with multiple cards. I am not really a machine learning expert, I'm curious to see if I can manage to trace down some performance issues. (I've already seen a couple issues get squashed since I first started testing this.) I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential. A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.