Remix clone Hacker News

new | show | ask | jobs Github

	▲	paulddraper a day ago
		Note this is also assuming you (1) Rent your GPUs. (2) Pay list price, no volume breaks. (3) Get only 85 tokens/sec. Realistically, frontier models would attain 200+ tokens/second amortized. Inference is extremely profitable at scale.
	▲	aurareturn a day ago \| parent [-]
		Assuming 80GB H100 and you inference a model that is MoE and close to the size of the 80GB VRAM, you're going to see around 10k tokens/second fully batched and saturated. An example here might be Mixtral 8x7B. You're generating about 36 million tokens/hour. Cost of Mixtral 8x7b on Open router is $0.54/M input tokens. $0.54/M output tokens. You're looking at potentially $38.88/hour return on that H100 GPU. This is probably the best case scenario. In reality, inference providers will use multiple GPUs together to run bigger, smarter models for a higher price.