| ▲ | aurareturn 4 hours ago | |
It does make sense. Nvidia chips do not promise 1,000+ tokens/s. The 80GB is external HBM, unlike Cerebras’ 44GB internal SRAM. The whole reason Cerebras can inference a model thousands of tokens per second is because it hosts the entire model in SRAM. There are two possible scenarios for Codex Spark: 1. OpenAI designed a model to fit exactly 44GB. 2. OpenAI designed a model that require Cerebras to chain multiple wafer chips together; IE, an 88GB or 132GB or 176GB model or more. Both options require the entire model to fit inside SRAM. | ||
| ▲ | woadwarrior01 3 hours ago | parent [-] | |
Let's not forget the KV-cache which needs a lot of RAM too (although not as much as the model weights), and scales up linearly with sequence length. | ||