The economics of local AI just doesn’t make sense. A model like Opus is - supposedly - something like 5T parameters, which is likely something like 3TB of GPU memory.

Local models never reach the % utilization that cloud providers have (80%+), and they’re always going to be much better than local models for this reason.

▲

lumost 3 hours ago | parent | next [-]

Capex, opex, quality, and volume are tricky things to balance. On balance, pc/mobile are cheaper to operate than equivalent cloud and on prem deployments.

It’s not unreasonable to suppose that in 2 years time an opus 5 quality model will be etched into silicon for high performance local inference. Then you just upgrade your model every 2-3 years by upgrading your hardware.

▲

jazzyjackson 2 hours ago | parent [-]

I haven't been following anyone baking models into ASICs, is it not still necessary to pack just as many transistors onto a chip, whether it's an NPU or GPU, ASIC or not you still need to hold hundreds of gigabytes in memory, so how is it cheaper to bake it onto custom silicon than running it on commodity VRAM? (Asking because I don't know!)

▲

lumost an hour ago | parent [-]

Not my area either! But my understanding is that there are more efficient methods of representing static numbers when you can skip the vram lookup.

https://taalas.com/

Is an example startup in this area claiming 16k tok/s on an asic for llama 8b. Qwen has a 27b model at opus 4.5 quality.

	▲	jazzyjackson 31 minutes ago \| parent [-]
		Neat, thanks for the link

▲

majormajor 2 hours ago | parent | prev [-]

Running local applications is less efficient than thin clients to the cloud generally, not just in LLMs. The trick is that you can get to the point where it's effective enough, and affordable enough, that the control and availability factors become dominant.

▲

stingraycharles 2 hours ago | parent [-]

My point is that you will always get much more value / $ by using cloud based solutions.

	▲	sroerick 24 minutes ago \| parent \| next [-]
		I don't know that this is true. The cloud companies are making money, and inferrence is kind of just "hosting an inferrence server and trying to keep it humming 24/7" But in many cases self hosted or dedicated boxes are cheaper than cloud.
	▲	majormajor 2 hours ago \| parent \| prev [-]
		I just don't see how that's different from getting more value by giving all your employees the most stripped-down chromebook-type devices and running everything else in the cloud, than by giving them "proper" laptops with local apps. It's a measure of a very thin sort of "value/$" that excludes a lot of other things that could be of value to a business, like control, predictability, and availability. Thin clients have been going away for a long time. The trend has been to continue to push higher levels of compute into ever-smaller and ever-more-portable devices.