Right, but they didn't use rented GPUs, so it's a purely notional figure. It's an appropriate value for comparison to other single training runs (e.g. it tells you that turning DeepSeek-V3 into DeepSeek-R1 cost much less than training DeepSeek-V3 from scratch) but not for the entire budget of a company training LLMs.

DeepSeek spent a large amount upfront to build a cluster that they can run lots of small experiments on over the course of several years. If you only focus on the successful ones, it looks like their costs are much lower than they were end-to-end.

▲

yunohn 14 hours ago | parent [-]

No, they’re saying training a model, specifically DeepSeek, costs X using N hrs of Y GPU rental.

	▲	yorwba 6 hours ago \| parent [-]
		If by "they" you mean DeepSeek, they're not saying this, since you might not actually be able to rent a cluster of 512 H800s wired together with high-bandwidth interconnects at that GPU-hour price point. If you rent smaller groups of GPUs piecemeal in different locations and try to transfer weight updates between them over the internet, it'll kill your throughput.