| ▲ | yorwba 5 hours ago |
| > The idea is to have a chip with SRAM large enough to fit the entire model, so inference can happen entirely in-memory. [...] So how much internal memory does the latest Cerebras chip have? 44GB. This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex. You don't really need to fit the entire model on a single chip. Just as with GPUs, you can shard the model across multiple chips. Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly. So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target. Given that Cerebras offers models much larger than 40B at faster speeds https://www.cerebras.ai/pricing#exploration GPT-5.3-Codex-Spark is likely closer to GLM 4.7 in size. (≈355B total parameters, 32B active) |
|
| ▲ | zozbot234 2 hours ago | parent | next [-] |
| Sharding the model is really slow. The point of building a wafer-scale chip is memory bandwidth for on-chip transfer is far more than you would get from even using chiplets with an interposer/high-bandwidth connection, let alone going off-chip. You're giving up your whole advantage, especially since Cerebras clearly isn't trying to maximize total throughput per watt - Groq, TPUs, and even the latest nVidia solutions are preferable there. |
| |
| ▲ | yorwba an hour ago | parent [-] | | There are ways to shard the model that require a lot of off-chip bandwidth, but there are also ways that don't. The only data that needs to be passed between layers is the residual stream, which requires much less bandwidth than the layer weights and KV cache, and you already need about that much bandwidth to get input tokens in and output tokens out. So putting different layers on different chips isn't that terrible. Importantly, Cerebras is offering many models that can't possibly fit on just a single chip, so they have to use some kind of sharding to get them to work at all. You could imagine an even bigger chip that can fit the entire model and run it even faster, but they have to work with what can be manufactured with current technology. |
|
|
| ▲ | amelius 4 hours ago | parent | prev | next [-] |
| > Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly. No, it only increases the latency, and does not affect the throughput. |
| |
| ▲ | EdNutting 4 hours ago | parent | next [-] | | It affects both. These systems are vastly more complex than the naive mental models being discussed in these comments. For one thing, going chip-to-chip is not a faultless process and does not operate at the same speed as on-chip communication. So, yes, throughput can be reduced by splitting a computation across two chips of otherwise equal speed. | |
| ▲ | qudent 4 hours ago | parent | prev [-] | | It does affect the throughput for an individual user because you need all output tokens up to n to generate output token n+1 | | |
| ▲ | EdNutting 4 hours ago | parent [-] | | :facepalm: - That’s not how that works. | | |
| ▲ | qudent 2 hours ago | parent | next [-] | | Because inference is autoregressive (token n is an input for predicting token n+1), the forward pass for token n+1 cannot start until token n is complete. For a single stream, throughput is the inverse of latency (T = 1/L). Consequently, any increase in latency for the next token directly reduces the tokens/sec for the individual user. | |
| ▲ | catoc 3 hours ago | parent | prev | next [-] | | Your comment may be helpful - but would be much more helpful if you shared how it does work. Edit: I see you’ re doing this further down; #thumbs up | |
| ▲ | littlestymaar 3 hours ago | parent | prev [-] | | How do you think that works?! With the exception of diffusion language models that don't work this way, but are very niche, language models are autoregressive, which means you indeed need to process token in order. And that's why model speed is such a big deal, you can't just throw more hardware at the problem because the problem is latency, not compute. |
|
|
|
|
| ▲ | johndough 4 hours ago | parent | prev [-] |
| > So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target. Chaining chips does not decrease token throughput. In theory, you could run models of any size on Cerebras chips. See for example Groq's (not to be confused with Grok) chips, which only have 230 MB SRAM, yet manage to run Kimi K2. |
| |
| ▲ | EdNutting 4 hours ago | parent [-] | | Only if chip-to-chip communication is as fast as on-chip communication. Which it isn’t. | | |
| ▲ | johndough 3 hours ago | parent | next [-] | | Only if chip-to-chip communication was a bottleneck. Which it isn't. If a layer completely fits in SRAM (as is probably the case for Cerebras), you only have to communicate the hidden states between chips for each token. The hidden states are very small (7168 floats for DeepSeek-V3.2 https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/c... ), which won't be a bottleneck. Things get more complicated if a layer does not fit in SRAM, but it still works out fine in the end. | |
| ▲ | littlestymaar 3 hours ago | parent | prev [-] | | It doesn't need to, during inference there's little data exchange between one chip and another (just a single embedding vector per token). It's completely different during training because of the backward pass and weight update, which put a lot of strain on the inter-chip communication, but during inference even x4 PCIe4.0 is enough to connect GPUs together and not lose speed. |
|
|