▲ | pama a day ago | |
> > V3/R1 scale models as a baseline, one can produce 720,000 tokens On what hardware? At how many tokens per second? But most importantly, at what quality? The hardware is the GB200 NVL72 by NVidia. This is for the class of 671B DeepSeek models, eg R1-0528 or V3, with their full accuracy setup (ie reproducing the quality of the reported DeepSeek benchmarks). Here is the writeup (by humans; the second figure shows the tokens per second per GPU as a function of the batch size, which emphasizes the advantages of centralized decoding, compared to current hacks at home): https://lmsys.org/blog/2025-06-16-gb200-part-1/ And here are the instructions to replicate the particular benchmark: https://github.com/sgl-project/sglang/issues/7227 The LLM text I linked in my original answer carries out the math using the energy consumption of the NVidia hardware setup (120kW) and rather simple arithmetic, which you can reproduce. |