Remix clone Hacker News

new | show | ask | jobs Github

	▲	rao-v 9 hours ago
		I doubt you'd get the same sort of result on a modern-ish MOE or dense model via a more standard inference engine like llama.cpp or VLLM. I don't think MLPerf is a reasonable benchmark at this point. Edit: Here is a simple llama.cpp compare where the token gen results match the rule of thumb. https://www.reddit.com/r/LocalLLaMA/comments/1st6lp6/nvidia_...