| ▲ | paulddraper a day ago | |
Note this is also assuming you (1) Rent your GPUs. (2) Pay list price, no volume breaks. (3) Get only 85 tokens/sec. Realistically, frontier models would attain 200+ tokens/second amortized. Inference is extremely profitable at scale. | ||
| ▲ | aurareturn a day ago | parent [-] | |
Assuming 80GB H100 and you inference a model that is MoE and close to the size of the 80GB VRAM, you're going to see around 10k tokens/second fully batched and saturated. An example here might be Mixtral 8x7B. You're generating about 36 million tokens/hour. Cost of Mixtral 8x7b on Open router is $0.54/M input tokens. $0.54/M output tokens. You're looking at potentially $38.88/hour return on that H100 GPU. This is probably the best case scenario. In reality, inference providers will use multiple GPUs together to run bigger, smarter models for a higher price. | ||