Remix.run Logo
himata4113 6 hours ago

I was looking into self-hosting deekseek v4 pro since frankly cache reads are an absolute scam and they're 90% of the cost, but then I looked at the ROI and it will never pay off fast enough because the hardware will become obsolete faster even if you were running 10 token generation streams 24/7.

The napkin math resulted that renting is around 27 times cheaper than owning (not including power). I think we're really screwed when it comes to having owned access to AI unless intel comes out swinging with a c series card that has 128gb vram so we can run them in a 4x128gb configuration, but seems unlikely since nvidia has a large share in them.

This was calculated expecting around 30tok/s, of course you can get 2-5tok/s much much cheaper, but it's unusable for my workflow.

kingstnap 5 hours ago | parent | next [-]

Ironically the few people not scamming you for cache reads are Deepseek.

Everyone else charges a ridiculous amount but Deepseeks API is $0.003625 / M tok.

I'm surprised no one talks about this because of how significant it is. GPT 5.5 for example costs a ridiculous $0.50 / M tok cached. It's literally almost 140 times cheaper which matters a lot for tool calls.

himata4113 5 hours ago | parent [-]

it's a temporary promo, deepseek will return to only 10x cheaper after.

kingstnap 4 hours ago | parent [-]

Yes Deepseek V4 pro is currently on discount.

> The deepseek-v4-pro model is currently offered at a 75% discount, extended until 2026/05/31 15:59 UTC.

However even when the discount ends its still very cheap. It will go back to $0.0145 / M cache hit. That's still 34x cheaper than GPT 5.5.

himata4113 2 hours ago | parent [-]

doesn't matter when subscriptions get cache reads for free, it is only really worth it if it's x340 cheaper otherwise I'd be paying $120 a day, 90% of the cost being cache reads for any top level opensource model.

dist-epoch 3 hours ago | parent | prev | next [-]

The only way to profitable serve AI is to have large batch sizes - run 500 requests at the same time.

If you serve a single user you'll never get your electricity price back, nevermind hardware costs.

varispeed 4 hours ago | parent | prev [-]

Would you mind sharing the napkin maths?

mordae 3 hours ago | parent [-]

Not OP, but basically take GiB/s and divide by 30. You need at least 128GiB to hold the model, too. It's expensive to get 200 GiB/s, very expensive to get 400 GiB/s and above that you are looking at DC-grade GPUs. Multiple, in fact.