> The cost to serve tokens is absolutely profitable today and that’s been true for at least a year.

> For the data center build outs, demand for tokens is still exceeding supply.

Can you provide any numbers for this?

I can get Kimi K2.5 inference on openrouter for about $0.5/MTok input + $2.5/MTok output, from six providers that have no moat besides efficiently selling GPU time. We can assume they are doing so at a profit (they have no incentive to do this at a loss), giving us those numbers as the cost to serve a 1T-a32b model at scale.

Now we don't know the true size of any of the proprietary models, but my educated guess is that Sonnet is in about the same parameter range, just with better training and much better fine tuning and RLHF. Yet API pricing for Sonnet is $3/MTok input + $15/MTok output, exactly six times as expensive. Even Haiku is twice as expensive as Kimi K2.5.

I find it difficult to believe in a world where those API prices aren't profitable. For subscription pricing it's harder to tell. We hear about those that get insane value out of their subscription, but there has to be a large mass who never reaches their limits. With company-wide rollouts there might even be a lot of subscription users who consume virtually no tokens at all.

▲

yobbo a day ago | parent | next [-]

> We can assume they are doing so at a profit

This is false. We may assume it's the most efficient way of generating revenue given their GPUs, but their overall profitability will just be a guess. They would still have incentives to run hardware at maximum, even when it's uncertain to eventually recoup costs.

> a world where those API prices aren't profitable

A lab with employees and models in training has other costs than the operating expenses of a GPU farm.

	▲	aurareturn a day ago \| parent \| next [-]
		Why would a company sell inference on Openrouter if they're not profitable? Except for Grog/Cerebras and a few other hardware companies looking to showcase their new chips. If they're losing money and have no VC backing, they'd just turn off the lights.
	▲	financltravsty a day ago \| parent \| prev [-]
		The actual inference is operated at a 95%+ margin.

▲

FiberBundle a day ago | parent | prev | next [-]

This is like saying that innovative medical drugs could be sold at a profit if only there was no patent protection and the innovative companies would still invest in R&D. Yes, on a token level pure inference costs might be profitable, but the frontier Ai labs will surely have to recoup their R&D investments at some point.

▲

jerojero a day ago | parent | prev | next [-]

Companies doing foundational models need to cover the cost of training which is much more expensive than training something like kimi.

▲

wongarsu a day ago | parent | next [-]

Yes. I would not consider Kimi a particularly good model relative to its size, and making a SotA model is a lot more expensive. But training costs are explicitly excluded when talking about the cost to serve tokens

▲

gruez a day ago | parent | prev [-]

>Companies doing foundational models need to cover the cost of training [...]

But that's moving the goalposts? The original claim was on inference itself, not the whole company.

> The cost to serve tokens is absolutely profitable today and that’s been true for at least a year.

	▲	lbreakjai a day ago \| parent [-]
		But that's the same as thinking "This bar is selling a cocktail for $15. I could make it at home for 30 cents. They're making $14.7 dollars of profit per cocktail, the owner must be a millionaire now!" Everything is profitable if you ignore the costs.

▲

ZitchDog a day ago | parent | prev | next [-]

> they have no incentive to do this at a loss

Are you sure? Surely there is a lot of interesting data in those LLM interactions.

	▲	wongarsu a day ago \| parent [-]
		Many of them are promising not to store any of this. Of course we have to trust them, for all we know they are all funded by various spy agencies

▲

KallDrexx a day ago | parent | prev [-]

The problem I have with this analysis is it's missing the multi-dimensional aspect of "is this profitable".

It's fair to say that if all these operators are competing for tokens, that the OpenRouter token operator (not sure the exact phrase but the people running the models) are accounting for some level of margin.

However, how many of these are running their own data centers and GPUs?

If they are running their own infrastructure, then it's not a simple equation of if each specific token set is profitable, since it needs to account for the cost of running the data center. It could be that they believe that it is profitable in the long term by utilizing the long tail of asset depreciation, but that isn't guaranteed.

IF they aren't running their own infrastructure, then it's much easier to claim that it's profitable and has a margin (outside of running their servers to manage the rented infrastructure).

HOWEVER, a lot of data centers have some pretty crazy low prices for GPUs that may be vying for user base and revenue over profitability. In these cases, if data center growth starts slowing due to slower buildout then it's very likely GPU prices go up and inference stops becoming profitable for the open router owners.

So long term it's not clear how profitable even these open models are.

OpenAI and Anthropic definitely fall into the latter category too. Their infrastructure requirements are much higher than the open models, and they are being given huge discounts so Microsoft/Amazon/Google can all claim revenue (since they have profitability coming from other parts). It's not clear if OpenAI and Anthropic models would be profitable at inference if they were paying rates that cloud hosts would make a profit from.

There's just way too many dimensions to this scenario to flat out state that open router proves inference is profitable at scale.

▲

ACCount37 a day ago | parent | prev | next [-]

Check the token prices for open weight LLMs at various independent inference providers.

That gives you a very good estimate of "how much can you serve the tokens of a model of the size N for while making a profit".

Now, keep in mind: Kimi K2.5 is 1T MoE. Today's frontier LLMs are in the 1T to 5T range, also MoE. Make an estimate. Compare that estimate with the actual frontier lab prices.

▲

lolc a day ago | parent [-]

I don't think it's as easy as looking at open weight API prices. We don't know whether the operators are making a profit on all the hardware they bought. Maybe the prices we pay just cover electricity. And it's not even certain that running costs are covered by API prices: The operators may be siphoning content and subsidize from selling that.

In the current volatile environment, the API prices are more of a baseline where we can assume it can't be much cheaper to operate these models.

	▲	aurareturn a day ago \| parent [-]
		That doesn't make sense in this environment because everyone is compute constrained with huge backlogs they can't fulfill. If these inference providers aren't making any money, they'd simply sell their GPUs to those who are starved for compute.

▲

bob1029 a day ago | parent | prev | next [-]

https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-dgx-b20...

▲

infecto a day ago | parent | prev | next [-]

Most/all private labs have cited inference is profitable. This was happening before the large push to scrap plans and largely charge folks the underlying api rates. Second take a look at the pricing of open models. Now certainly it’s not direct 1-1 comparison but we can use it as a baseline. Now of course folks might not be telling the truth but one of those situations where I see too many markers on the true side.

For supply look at outages and growth rates at companies like openrouter. The demand is growing every week.

▲

paulddraper a day ago | parent | prev | next [-]

Anthropic has said inference is profitable. That’s a biased source, but the math pencils.

This is why switching to local open weight models saves a lot of money. (Even though it’s not apples to apples.)

▲

drakythe a day ago | parent | next [-]

Anthropic also recently tweaked their usage limits to discourage use during peak hours. Why would they do that if inference was profitable?

	▲	infecto a day ago \| parent \| next [-]
		Don’t confuse inference (api usage) with the consumer plan products. When people say inference is profitable they are referring to the cost to serve a token via the API. The consumer products are absolutely a question mark on profitability and as we see with most of the business and enterprise plans, going away for pure on demand use (api cost) full time.
	▲	strangegecko a day ago \| parent \| prev \| next [-]
		Profitability doesn't imply infinite ability to scale. Of course they will want to prioritize their most profitable customers when they hit capacity issues.
	▲	aurareturn a day ago \| parent \| prev \| next [-]
		They do it because their demand is higher than the compute that they have available to them. Their GPUs must be melting during peak hours so they're encouraging people who move their workload to off peak hours if possible. This is the opposite of an AI bubble burst.
	▲	paulddraper a day ago \| parent \| prev \| next [-]
		Those are subscription plans. They tweaked the limits/periods included in the subscription. Having higher limits for subscription plans didn't give them any more revenue.
	▲	financltravsty a day ago \| parent \| prev [-]
		Their infra team is very understaffed and they are reacting to the public backlash of "no 9s?"

▲

nyeah a day ago | parent | prev [-]

Can you give a few penciled numbers?

▲

paulddraper a day ago | parent [-]

You can rent a H100 GPU for $4/hour. [1]

300k tokens for that hour.

OpenAI charges $6.

Those are pessimistic assumptions.

[1] https://lambda.ai/instances

▲

hajile a day ago | parent | next [-]

Can you keep that GPU 100% saturated at least 16 hours per day every day of the week?

If not, you aren't breaking even.

▲

paulddraper a day ago | parent [-]

Note this is also assuming you

(1) Rent your GPUs.

(2) Pay list price, no volume breaks.

(3) Get only 85 tokens/sec. Realistically, frontier models would attain 200+ tokens/second amortized.

Inference is extremely profitable at scale.

	▲	aurareturn a day ago \| parent [-]
		Assuming 80GB H100 and you inference a model that is MoE and close to the size of the 80GB VRAM, you're going to see around 10k tokens/second fully batched and saturated. An example here might be Mixtral 8x7B. You're generating about 36 million tokens/hour. Cost of Mixtral 8x7b on Open router is $0.54/M input tokens. $0.54/M output tokens. You're looking at potentially $38.88/hour return on that H100 GPU. This is probably the best case scenario. In reality, inference providers will use multiple GPUs together to run bigger, smarter models for a higher price.

▲

drakythe a day ago | parent | prev [-]

3.99 at 8x instances, with a minimum 2 week commitment. Good luck getting 70% usage average during that time. Useful when you're running a training round and can properly gauge demand, not so great when you're offering an API.

▲

infecto a day ago | parent [-]

Is it not a good penciled number? It helps set the directional tone that at inference cost is being covered.

	▲	drakythe a day ago \| parent [-]
		It says the numbers are theoretically possible. Requiring a 66% usage to break even when 100% usage will piss off customers by invoking a queue means it’s a balancing act. “Technically correct. The best kind of correct”. So inference may technically be _capable_ of being profitable, but I have question’s about them being profitable in _practice_.

▲

a day ago | parent | prev [-]

[deleted]