Remix clone Hacker News

new | show | ask | jobs Github

	▲	zozbot234 10 hours ago
		MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance
	▲	ericd 10 hours ago \| parent [-]
		Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way. EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.