Remix clone Hacker News

new | show | ask | jobs Github

	▲	int_19h 5 days ago
		> reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model That's what MoE is for. It might be that with their TPUs, they can afford lots of params, just so long as the activated subset for each token is small enough to maintain throughput.