Remix.run Logo
paulddraper a day ago

Anthropic has said inference is profitable. That’s a biased source, but the math pencils.

This is why switching to local open weight models saves a lot of money. (Even though it’s not apples to apples.)

drakythe a day ago | parent | next [-]

Anthropic also recently tweaked their usage limits to discourage use during peak hours. Why would they do that if inference was profitable?

infecto a day ago | parent | next [-]

Don’t confuse inference (api usage) with the consumer plan products. When people say inference is profitable they are referring to the cost to serve a token via the API. The consumer products are absolutely a question mark on profitability and as we see with most of the business and enterprise plans, going away for pure on demand use (api cost) full time.

strangegecko a day ago | parent | prev | next [-]

Profitability doesn't imply infinite ability to scale. Of course they will want to prioritize their most profitable customers when they hit capacity issues.

aurareturn a day ago | parent | prev | next [-]

They do it because their demand is higher than the compute that they have available to them. Their GPUs must be melting during peak hours so they're encouraging people who move their workload to off peak hours if possible.

This is the opposite of an AI bubble burst.

paulddraper a day ago | parent | prev | next [-]

Those are subscription plans. They tweaked the limits/periods included in the subscription. Having higher limits for subscription plans didn't give them any more revenue.

financltravsty a day ago | parent | prev [-]

Their infra team is very understaffed and they are reacting to the public backlash of "no 9s?"

nyeah a day ago | parent | prev [-]

Can you give a few penciled numbers?

paulddraper a day ago | parent [-]

You can rent a H100 GPU for $4/hour. [1]

300k tokens for that hour.

OpenAI charges $6.

Those are pessimistic assumptions.

[1] https://lambda.ai/instances

hajile a day ago | parent | next [-]

Can you keep that GPU 100% saturated at least 16 hours per day every day of the week?

If not, you aren't breaking even.

paulddraper a day ago | parent [-]

Note this is also assuming you

(1) Rent your GPUs.

(2) Pay list price, no volume breaks.

(3) Get only 85 tokens/sec. Realistically, frontier models would attain 200+ tokens/second amortized.

Inference is extremely profitable at scale.

aurareturn a day ago | parent [-]

Assuming 80GB H100 and you inference a model that is MoE and close to the size of the 80GB VRAM, you're going to see around 10k tokens/second fully batched and saturated. An example here might be Mixtral 8x7B.

You're generating about 36 million tokens/hour. Cost of Mixtral 8x7b on Open router is $0.54/M input tokens. $0.54/M output tokens.

You're looking at potentially $38.88/hour return on that H100 GPU. This is probably the best case scenario.

In reality, inference providers will use multiple GPUs together to run bigger, smarter models for a higher price.

drakythe a day ago | parent | prev [-]

3.99 at 8x instances, with a minimum 2 week commitment. Good luck getting 70% usage average during that time. Useful when you're running a training round and can properly gauge demand, not so great when you're offering an API.

infecto a day ago | parent [-]

Is it not a good penciled number? It helps set the directional tone that at inference cost is being covered.

drakythe a day ago | parent [-]

It says the numbers are theoretically possible. Requiring a 66% usage to break even when 100% usage will piss off customers by invoking a queue means it’s a balancing act.

“Technically correct. The best kind of correct”. So inference may technically be _capable_ of being profitable, but I have question’s about them being profitable in _practice_.