Remix.run Logo
btbuildem 4 hours ago

> doesn't make financial sense to self-host

I guess that's debatable. I regularly run out of quota on my claude max subscription. When that happens, I can sort of kind of get by with my modest setup (2x RTX3090) and quantized Qwen3.

And this does not even account for privacy and availability. I'm in Canada, and as the US is slowly consumed by its spiral of self-destruction, I fully expect at some point a digital iron curtain will go up. I think it's prudent to have alternatives, especially with these paradigm-shattering tools.

jsheard 4 hours ago | parent | next [-]

I think AI may be the only place you could get away with calling a 2x350W GPU rig "modest".

That's like ten normal computers worth of power for the GPUs alone.

dymk 2 hours ago | parent | next [-]

That's maybe a few dollars to tens of dollars in electricity per month depending on where in the US you live

bigyabai an hour ago | parent | prev | next [-]

> That's like ten normal computers worth of power for the GPUs alone.

Maybe if your "computer" in question is a smartphone? Remember that the M3 Ultra is a 300w+ chip that won't beat one of those 3090s in compute or raster efficiency.

jsheard an hour ago | parent [-]

I wouldn't class the M3 Ultra as a "normal" computer either. That's a big-ass workstation. I was thinking along the lines of a typical Macbook or Mac Mini or Windows laptop, which are fine for 99% of anyone who isn't looking to run gigantic AI models locally.

bigyabai an hour ago | parent [-]

Those aren't "normal" computers, either. They're iPad chips running in the TDP envelope of a tablet, usually with iPad-level performance to match.

kataklasm 3 hours ago | parent | prev [-]

Did you even try to read and understand the parent comment? They said they regularly run out of quota on the exact subscription you're advising they subscribe to.

h3half 3 hours ago | parent [-]

Pot, kettle

wongarsu 4 hours ago | parent | prev | next [-]

Self-hosting training (or gaming) makes a lot of sense, and once you have the hardware self-hosting inference on it is an easy step.

But if you have to factor in hardware costs self-hosting doesn't seem attractive. All the models I can self-host I can browse on openrouter and instantly get a provider who can get great prices. With most of the cost being in the GPUs themselves it just makes more sense to have others do it with better batching and GPU utilization

zozbot234 4 hours ago | parent [-]

If you can get near 100% utilization for your own GPUs (i.e. you're letting requests run overnight and not insisting on any kind of realtime response) it starts to make sense. OpenRouter doesn't have any kind of batched requests API that would let you leverage that possibility.

spmurrayzzz 3 hours ago | parent | next [-]

For inference, even with continuous batching, getting 100% MFUs is basically impossible to do in practice. Even the frontier labs struggle with this in highly efficient infiniband clusters. Its slightly better with training workloads just due to all the batching and parallel compute, but still mostly unattainable with consumer rigs (you spend a lot of time waiting for I/O).

I also don't think the 100% util is necessary either, to be fair. I get a lot of value out of my two rigs (2x rtx pro 6000, and 4x 3090) even though it may not be 24/7 100% MFU. I'm always training, generating datasets, running agents, etc. I would never consider this a positive ROI measured against capex though, that's not really the point.

zozbot234 3 hours ago | parent [-]

Isn't this just saying that your GPU use is bottlenecked by things such as VRAM bandwidth and RAM-VRAM transfers? That's normal and expected.

sowbug 3 hours ago | parent | prev [-]

In Silicon Valley we pay PG&E close to 50 cents per kWh. An RTX 6000 PC uses about 1 kW at full load, and renting such a machine from vast.ai costs 60 cents/hour as of this morning. It's very hard for heavy-load local AI to make sense here.

btbuildem 3 hours ago | parent | next [-]

Yikes.. I pay ~7¢ per kWh in Quebec. In the winter the inference rig doubles as a space heater for the office, I don't feel bad about running local energy-wise.

Imustaskforhelp 3 hours ago | parent | prev [-]

And you are forgetting the fact that things like vast.ai subscriptions would STILL be more expensive than Openrouter's api pricing and even more so in the case of AI subscriptions which actively LOSE money for the company.

So I would still point out the GP (Original comment) where yes, it might not make financial sense to run these AI Models [They make sense when you want privacy etc, which are all fair concerns but just not financial sense]

But the fact that these models are open source still means that they can be run when maybe in future the dynamics might shift and it might make sense running such large models locally. Even just giving this possibility and also the fact that multiple providers could now compete in say openrouter etc. as well. All facts included, definitely makes me appreciate GLM & Kimi compared to proprietory counterparts.

Edit: I highly recommend this video a lot https://www.youtube.com/watch?v=SmYNK0kqaDI [AI subscription vs H100]

This video is honestly one of the best in my opinion about this topic that I watched.

HumanOstrich 3 hours ago | parent [-]

Why did you quote yourself at the end of this comment?

Imustaskforhelp 2 hours ago | parent [-]

Oops sorry. Fixed it now but I am trying a HN progressive extension and what it does is if I have any text selected it can actually quote it and I think this is what might've happened or such a bug I am not sure.

It's fixed now :)

Aurornis 3 hours ago | parent | prev | next [-]

> I regularly run out of quota on my claude max subscription. When that happens, I can sort of kind of get by with my modest setup (2x RTX3090) and quantized Qwen3.

When talking about fallback from Claude plans, The correct financial comparison would be the same model hosted on OpenRouter.

You could buy a lot of tokens for the price of a pair of 3090s and a machine to run them.

bigyabai an hour ago | parent [-]

> You could buy a lot of tokens for the price of a pair of 3090s and a machine to run them.

That's a subjective opinion, to which the answer is "no you can't" for many people.

mythz 4 hours ago | parent | prev | next [-]

Did the napkin math on M3 Ultra ROI when DeepSeek V3 launched: at $0.70/2M tokens and 30 tps, a $10K M3 Ultra would take ~30 years of non-stop inference to break even - without even factoring in electricity. Clearly people aren't self-hosting to save money.

I've got a lite GLM sub $72/yr which would require 138 years to burn through the $10K M3 Ultra sticker price. Even GLM's highest cost Max tier (20x lite) at $720/yr would buy you ~14 years.

ljosifov 3 hours ago | parent | next [-]

Everyone should do the calculation for themselves. I too pay for couple of subs. But I'm noticing having an agent work for me 24/7 changes the calculation somewhat. Often not taken into account: the price of input tokens. To produce 1K of code for me, the agent may need to churn through 1M of tokens of codebase. IDK if that will be cached by the API provider or not, but that makes x5-7 times price difference. OK discussion today about that and more https://x.com/alexocheema/status/2020626466522685499

wongarsu 4 hours ago | parent | prev | next [-]

And it's worth noting that you can get DeepSeek at those prices from DeepSeek (Chinese), DeepInfra (US with Bulgarian founder), NovitaAI (US), AtlasCloud (US with Chinese founder), ParaSail (US), etc. There is no shortage of companies offering inference, with varying levels of trustworthiness, certificates and promises around (lack of) data retention. You just have to pick one you trust

oceanplexian 3 hours ago | parent | prev | next [-]

Doing inference with a Mac Mini to save money is more or less holding it wrong. Of course if you buy some overpriced Apple hardware it’s going to take years to break even.

Buy a couple real GPUs and do tensor parallelism and concurrent batch requests with vllm and it becomes extremely cost competitive to run your own hardware.

mythz 3 hours ago | parent [-]

> Doing inference with a Mac Mini to save money is more or less holding it wrong.

No one's running these large models on a Mac Mini.

> Of course if you buy some overpriced Apple hardware it’s going to take years to break even.

Great, where can I find cheaper hardware that can run GLM 5's 745B or Kimi K2.5 1T models? Currently it requires 2x M3 Ultras (1TB VRAM) to run Kimi K2.5 at 24 tok/s [1] What are the better value alternatives?

[1] https://x.com/alexocheema/status/2016404573917683754

DeathArrow 3 hours ago | parent | prev | next [-]

I don't think an Apple PC can run full Deepseek or GLM models.

Even if you quantize the hell out of the models to fit in the memory, they will be very slow.

retr0rocket 4 hours ago | parent | prev [-]

[dead]

visarga 3 hours ago | parent | prev | next [-]

Your $5,000 PC with 2 GPUs could have bought you 2 years of Claude Max, a model much more powerful and with longer context. In 2 years you could make that investment back in pay raise.

benterix 2 hours ago | parent | next [-]

> In 2 years you could make that investment back in pay raise.

Could you elaborate? I fail to grasp the implication here.

tw1984 2 hours ago | parent | prev | next [-]

> In 2 years you could make that investment back in pay raise.

you can't be a happy uber driver making more money in the next 24 months by having a fancy car fitted with the best FSD in town when all cars in your town have the same FSD.

visarga 2 hours ago | parent [-]

But they don't have the same human in the loop though.

tw1984 an hour ago | parent [-]

that software is called autonomous agents, the term autonomous has nothing to do with human in the loop, it is the complete opposite.

dymk an hour ago | parent | prev [-]

This claim has so many assumptions mixed in it's utterly useless

7thpower 4 hours ago | parent | prev | next [-]

Unless you already had those cards, it probably still doesn’t make sense from a purely financial perspective unless you have other things you’re discounting for.

Doesn’t mean you shouldn’t do it though.

flaviolivolsi 4 hours ago | parent | prev | next [-]

How does your quantized Qwen3 compares in code quality to Opus?

Aurornis 4 hours ago | parent | next [-]

Not the person you’re responding to, but my experience with models up through Qwen3-coder-next is that they’re not even close.

They can do a lot of simple tasks in common frameworks well. Doing anything beyond basic work will just burn tokens for hours while you review and reject code.

btbuildem 3 hours ago | parent | prev [-]

It's just as fast, but not nearly as clever. I can push the context size to 120k locally, but quality of the work it delivers starts to falter above say 40k. Generally you have to feed it more bite-sized pieces, and keep one chat to one topic. It's definitely a step down from SOTA.

4 hours ago | parent | prev [-]
[deleted]