I get tempted to buy a couple of these, but I just feel like the amortization doesn’t make sense yet. Surely in the next few years this will be orders of magnitude cheaper.

▲

NitpickLawyer a day ago | parent | next [-]

Before committing to purchasing two of these, you should look at the true speeds that few people post. Not just the "it works". We're at a point where we can run these very large models "at home", and it is great! But true usage is now with very large contexts, both in prompt processing, and token generations. Whatever speeds these models get at "0" context is very different than what they get at "useful" context, especially in coding and such.

	▲	solarkraft a day ago \| parent \| next [-]
		Are there benchmarks that effectively measure this? This is essential information when speccing out an inference system/model size/quantization type.
	▲	cubefox a day ago \| parent \| prev [-]
		DeepSeek-v3.2 should be be better for long context because it is using (near linear) sparse attention.

▲

stingraycharles a day ago | parent | prev | next [-]

I don’t think it will ever make sense; you can buy so much cloud based usage for this type of price.

From my perspective, the biggest problem is that I am just not going to be using it 24/7. Which means I’m not getting nearly as much value out of it as the cloud based vendors do from their hardware.

Last but not least, if I want to run queries against open source models, I prefer to use a provider like Groq or Cerebras as it’s extremely convenient to have the query results nearly instantly.

▲

websiteapi a day ago | parent | next [-]

my issue is once you have it in your workflow I'd be pretty latency sensitive. imagine those record-it-all apps working well. eventually you'd become pretty reliant on it. I don't want to necessarily be at the whims of the cloud

	▲	stingraycharles a day ago \| parent [-]
		Aren’t those “record it all” applications implemented as a RAG and injected into the context based on embedding similarity? Obviously you’re not going to always inject everything into the context window.

▲

a day ago | parent | prev | next [-]

[deleted]

▲

lordswork a day ago | parent | prev | next [-]

As long as you're willing to wait up to an hour for your GPU to get scheduled when you do want to use it.

▲

stingraycharles a day ago | parent [-]

I don’t understand what you’re saying. What’s preventing you from using eg OpenRouter to run a query against Kimi-K2 from whatever provider?

▲

hu3 a day ago | parent | next [-]

and you'll get a faster model this way

▲

bgwalter a day ago | parent | prev [-]

Because you have Cloudflare (MITM 1), Openrouter (MITM 2) and finally the "AI" provider who can all read, store, analyze and resell your queries.

EDIT: Thanks for downvoting what is literally one of the most important reasons for people to use local models. Denying and censoring reality does not prevent the bubble from bursting.

	▲	irthomasthomas a day ago \| parent [-]
		you can use chutes.ai TEE (Trusted Execution Environment) and Kimi K2 is running at about 100t/s rn

▲

givinguflac a day ago | parent | prev [-]

I think you’re missing the whole point, which is not using cloud compute.

▲

stingraycharles a day ago | parent [-]

Because of privacy reasons? Yeah I’m not going to spend a small fortune for that to be able to use these types of models.

▲

givinguflac a day ago | parent [-]

There are plenty of examples and reasons to do so besides privacy- because one can, because it’s cool, for research, for fine tuning, etc. I never mentioned privacy. Your use case is not everyone’s.

▲

wyre a day ago | parent [-]

All of those things you can still do renting AI server compute though? I think privacy and cool-factor are the only real reasons why it would be rational for someone to spend checks the apple store $19,000 on computer hardware...

	▲	givinguflac 8 hours ago \| parent [-]
		Why do you look at this as a consumer? Have you never heard of businesses spending money on hardware???

▲

chrsw a day ago | parent | prev | next [-]

The only reason why you run local models is for privacy, never for cost. Or even latency.

▲

websiteapi a day ago | parent | next [-]

indeed - my main use case is those kind of "record everything" sort of setups. I'm not even super privacy conscious per se but it just feels too weird to send literally everything I'm saying all of the time to the cloud.

luckily for now whisper doesn't require too much compute, bu the kind of interesting analysis I'd want would require at least a 1B parameter model, maybe 100B or 1T.

	▲	nottorp a day ago \| parent [-]
		> t just feels too weird to send literally everything I'm saying all of the time to the cloud ... or your clients' codebases ...

▲

andy99 a day ago | parent | prev [-]

Autonomy generally, not just privacy. You never know what the future will bring, AI will be enshittified and so will hubs like huggingface. It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

▲

Aurornis a day ago | parent | next [-]

> You never know what the future will bring, AI will be enshittified and so will hubs like huggingface.

If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.

> It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

You can pay cloud providers for access to the same models that you can run locally, though. You don’t need a local setup even for this unlikely future scenario where all of the mainstream LLM providers simultaneously decided to make their LLMs poor quality and none of them sees this as market opportunity to provide good service.

But even if we ignore all of that and assume that all of the cloud inference everywhere becomes bad at the same time at some point in the future, you would still be better off buying your own inference hardware at that point in time. Spending the money to buy two M3 Ultras right now to prepare for an unlikely future event is illogical.

The only reason to run local LLMs is if you have privacy requirements or you want to do it as a hobby.

▲

CamperBob2 a day ago | parent [-]

If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.

OK. How do we set up this wager?

I'm not knowledgeable about online gambling or prediction markets, but further enshittification seems like the world's safest bet.

▲

Aurornis a day ago | parent [-]

> but further enshittification seems like the world's safest bet.

Are you really, actually willing to bet that today's hosted LLM performance per dollar is the peak? That it's all going to be worse at some arbitrary date (necessary condition for establishing a bet) in the future?

Would need to be evaluated by a standard benchmark, agreed upon ahead of time. No loopholes or vague verbiage allow something to be claimed as "enshittification" or other vague terms.

	▲	CamperBob2 a day ago \| parent [-]
		Sorry, didn't realize what you were actually referring to. Certainly I'd assume the models will keep getting better from the standpoint of reasoning performance. But much of that improved performance will be used to fool us into buying whatever the sponsor is selling. That part will get worse, given that it hasn't really even begun ramping up yet. We are still in the "$1 Uber ride" stage, where it all seems like a never-ending free lunch.

▲

chrsw a day ago | parent | prev [-]

Yes, I agree. And you can add security to that too.

▲

alwillis a day ago | parent | prev | next [-]

Hopefully the next time it’s updated, it should ship with some variant of the M5.

▲

amelius a day ago | parent | prev | next [-]

Maybe wait until RAM prices have normalized again.

▲

segmondy a day ago | parent | prev [-]

This is a weird line of thinking. Here's a question. If you buy one of these and figure out how to use it to make $100k in 3 months, would that be good? When you run a local model, you shouldn't compare it to to cost of using an API. The value lies in how you use it. Let's forget bout making money. Let's just say you have weird fetish and like to have dirty sexy conversation with your LLM. How much would you pay for your data not to be leaked and for the world to see your chat? Perhaps having your own private LLM makes it all worth it. If you have nothing special going then by all means use APIs, but if you feel/know your input it special, then yeah, go private.