Remix.run Logo
deng 6 hours ago

I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...

redfloatplane 6 hours ago | parent | next [-]

> I pay ~3$ per 1M/tokens for that model on Openrouter

I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.

I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.

jubilanti 5 hours ago | parent [-]

You're treating open weight inference providers the same as proprietary ones. They're fundamentally different business models. Proprietary companies have an incentive to subsidize actual inference and training costs in order to gain market share. The few dozen or so companies selling Qwen models by the token on openrouter are in a commodities market.

If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.

bee_rider 3 hours ago | parent | next [-]

I don’t know anything about the open weight host business model. Do we know for certain that the folks selling inference by the token are really selling them in an upfront and profitable way? No subsidies from harvesting the info, to sell to the model trainers or anything like that?

redfloatplane 5 hours ago | parent | prev [-]

I was thinking of user-side regulations as well, not only provider-side ones. I could imagine a world where a government rules that you may not use LLMs for anything, which would be much easier to get around if you have local means.

ThunderSizzle 5 hours ago | parent | prev | next [-]

An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window (with room to spare) with a bit of fine tuning llamacpp-vulkan, but llamacpp's repository instability and lack of real versioning frustrates me.

In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.

Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.

There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.

rsync 2 hours ago | parent | next [-]

"An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window ..."

How would that change (improve) if you had two R9700 in a similar configuration ?

vardalab an hour ago | parent [-]

better prompt processing like 1.5x+ and more kv but tg most likely lower like 0.8x or so but I am just going by memory for Qwen3.5 without mtp.

bertili 5 hours ago | parent | prev [-]

Qwen 27b is a compute heavy dense model.

alexjplant 3 hours ago | parent | prev | next [-]

I've spent the past week trying to scheme a way to get affordable local inference of something useful (Qwen3.6-36B-A3B) for ~$500 and have come to the conclusion that it simply isn't viable. A pair of power-restricted P100s in a workstation gets close but the workstations themselves are expensive and rare as hen's teeth (not to mention loud and large). I think early '27 will be when things open up as the hardware market unclenches and further strides are made in small capable models.

medfield 5 hours ago | parent | prev | next [-]

I use local models to explore, hosted models to refine. I somewhat envy those who can sustain local models (q8 120b+) running as a hobby.... for me, the practical path is a better SearXNG setup and knowing my routes forward.

PeterStuer 5 hours ago | parent | prev | next [-]

When they declare open models a 'security risk', his setup will be running, yours will not and even that 3090 will be way outside of your reach.

alexhans 3 hours ago | parent | prev | next [-]

I think it's important to be able to do both so you can stay in control of the price to value created relationship.

In last year, some people were publishing aider /ollama/open router [1] and now thankfully people are publishing all around about pi/qwen/llama.cpp/openrouter. It's widespread.

[1] https://alexhans.github.io/posts/aider-with-open-router.html

TSiege 6 hours ago | parent | prev | next [-]

It’s a personal hobby project why should we care this is how someone chooses to spend their free time and money? Lots of hobbies are expensive and pointless if you think of commercially available offerings. That’s why it’s a hobby and not a small business

pier25 2 hours ago | parent | prev | next [-]

> not to mention the electricity to run them...

And noise.

amelius 4 hours ago | parent | prev | next [-]

You are paying with your privacy ...

toyg 5 hours ago | parent | prev | next [-]

Yeah but they can also be used to play games and do other stuff.

NicoJuicy 5 hours ago | parent | prev | next [-]

Rtx 3090 24 gb set me back 390€ a year ago ( 2nd hand)

rirze 5 hours ago | parent [-]

Was it still in good condition? That price makes me wonder if it was used for crypto mining, which can wear down the hardware.

gsora 5 hours ago | parent [-]

Any sane crypto miner undervolted and underclocked their GPUs for efficiency's sake; if anything, they went through less wear than, say, regular gaming.

Der_Einzige 5 hours ago | parent | prev | next [-]

Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.

Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.

jubilanti 5 hours ago | parent [-]

> Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.

Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.

> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI

But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.

flowbarai 2 hours ago | parent | prev [-]

[flagged]