Remix.run Logo
zozbot234 2 hours ago

Sharding the model is really slow. The point of building a wafer-scale chip is memory bandwidth for on-chip transfer is far more than you would get from even using chiplets with an interposer/high-bandwidth connection, let alone going off-chip. You're giving up your whole advantage, especially since Cerebras clearly isn't trying to maximize total throughput per watt - Groq, TPUs, and even the latest nVidia solutions are preferable there.

yorwba an hour ago | parent [-]

There are ways to shard the model that require a lot of off-chip bandwidth, but there are also ways that don't. The only data that needs to be passed between layers is the residual stream, which requires much less bandwidth than the layer weights and KV cache, and you already need about that much bandwidth to get input tokens in and output tokens out. So putting different layers on different chips isn't that terrible.

Importantly, Cerebras is offering many models that can't possibly fit on just a single chip, so they have to use some kind of sharding to get them to work at all. You could imagine an even bigger chip that can fit the entire model and run it even faster, but they have to work with what can be manufactured with current technology.