Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.

▲

wongarsu 7 hours ago | parent | next [-]

I know of multiple businesses in Europe that have been doing that for a while with 70B models, and are upgrading hardware to run the new crop of 700B-1T models (really started around Kimi K2, but buying and hosting that kind of hardware takes time)

Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic

▲

user43928 4 hours ago | parent | next [-]

While certainly there are such cases with trade secrets, it's worth noting that even large banks typically have a provider like Azure or AWS onboarded.

There they can deploy these models while using the existing legal frameworks.

▲

CubsFan1060 6 hours ago | parent | prev [-]

What kind of hardware/price does it take to run those?

▲

bitmasher9 6 hours ago | parent | next [-]

Nvidia will sell you an entire server rack ready for inference. Or maybe you can roll out your own Blackwell based system.

We’re approaching a world where running a primer frontier model is possible on a workstation, probably will have something under $30k that looks like a desktop for Nvidia’s next generation. It sounds expensive, until you look at your Anthropic bill.

It’s similar unit economics as could computing for the open models. You can save a ton on the expenses by buying the hardware, but it requires a lot of in-house expertise, and you get the most value if you keep the system operating around the clock. The big kink is open models are usually 2 quarters behind frontier, and your competitors are probably trying to get access to mythos.

▲

program_whiz 4 hours ago | parent [-]

"approaching" is doing some work there. $30K today will get you 90-144GB usable VRAM with solid system RAM and disk and CPU. A single B200 chip at 180GB is $40K. Unfortunately that is nowhere close to being able to run a 750B param model. For something like that, we're getting closer to 1TB VRAM (8+ H200/B200), and then 1M context KV cache is many more GBs on top of that.

That's a $500K-$1M+ rig as of now. That's a lot of $200 subscriptions to break even, but reasonable if you are paying Anthropic $25/M tokens. Then of course there's the power, cooling, and maintenance to consider...

But yeah, I can see if the prices come down 10x in a few years, or crater after the bubble, $30-40k might get you a decent machine.

▲

zozbot234 2 hours ago | parent [-]

> Unfortunately that is nowhere close to being able to run a 750B param model. For something like that, we're getting closer to 1TB VRAM

You don't have to run a model from VRAM, or even from a sizeable amount of RAM. These choices only ever make sense when serving the model at scale, to hundreds of simultaneous users or more.

	▲	bitmasher9 an hour ago \| parent [-]
		For workstation inference a unified memory architecture would be a good cost/performance balance, while keeping COGs reasonable. 512GB unified memory macs are available, with the ram upgrade costing a few grand.

▲

wongarsu 6 hours ago | parent | prev [-]

For an 8-bit quant (what people call "near lossless") you are looking at something like 4xMI350X, which comes out to about $150k after adding the rest of the server. More if you go with Nvidia instead of AMD. More if you want more than maybe 8x concurrency

But prices are changing rapidly, and not for the better

▲

MikhailTal 7 hours ago | parent | prev | next [-]

This is not a new situation. This was happening also when good vision models like alexa net were coming through, especially for OCR. Companies had choice between cloud or self hosting with GPUs. But turns out, problem is usage patterns.

Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms

▲

Havoc 7 hours ago | parent | prev | next [-]

It’s a ~750B model so still a hell of a lot of vram

Would need to be a pretty determined medium biz

▲

moffkalast 7 hours ago | parent | prev | next [-]

So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.

▲

petesergeant 7 hours ago | parent | prev | next [-]

Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.

▲

CubsFan1060 7 hours ago | parent | next [-]

I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.

▲

tancop 7 hours ago | parent | prev [-]

if you can afford the investment you get stable low costs for years with better security (at least if your cyber team is good). its even better in regulated industries where some vendors might add a premium for hipaa/soc/pci dss compliance to the point its a lot cheaper to self host. for a smaller business its not worth it and you should just use a hosted open model.

	▲	petesergeant 5 hours ago \| parent [-]
		> to the point its a lot cheaper to self host I'm pretty skeptical, especially given typical utilization patterns. Do you have numbers, or this is just vibes?

▲

re-thc 7 hours ago | parent | prev [-]

> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

Years.

Even Microsoft said they don't have enough for Github and need to call Amazon.

Getting a few even at decent prices is hard. Unless the shortages goes down...