Remix.run Logo
yorwba 2 hours ago

There are ways to shard the model that require a lot of off-chip bandwidth, but there are also ways that don't. The only data that needs to be passed between layers is the residual stream, which requires much less bandwidth than the layer weights and KV cache, and you already need about that much bandwidth to get input tokens in and output tokens out. So putting different layers on different chips isn't that terrible.

Importantly, Cerebras is offering many models that can't possibly fit on just a single chip, so they have to use some kind of sharding to get them to work at all. You could imagine an even bigger chip that can fit the entire model and run it even faster, but they have to work with what can be manufactured with current technology.