Remix.run Logo
int_19h 5 days ago

> reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model

That's what MoE is for. It might be that with their TPUs, they can afford lots of params, just so long as the activated subset for each token is small enough to maintain throughput.