I think there's another route this goes. At $7k a year or more per eng in token use, I think it's very reasonable to buy engineers machines with obscene GPUs and RAM and run models locally. And if it doesn't make sense now, someone will figure it out and save companies $10k+/eng over 3 years.

▲

no-name-here a day ago | parent | next [-]

If you only want/need the kind of model output that can be served on a machine costing single digit thousands, aren’t cheaper cloud-served models available? (And as the sister comment points out, sharing hardware allows greater utilization and lower costs per user.)

	▲	Glyptodon a day ago \| parent [-]
		That might just mean even more savings as you'd only need need a size n cluster for m engineers where n is probably < m.

▲

slopinthebag a day ago | parent | prev | next [-]

I imagine there are companies forming now with their entire business model being building "prosumer" inference machines and farms running everything from Qwen 3.6 27b up to GLM 5.1 and everything in between, packaged perfectly for companies to make one-time investments in with the assumption that open models will be getting both more efficient and better over time.

▲

charcircuit 2 days ago | parent | prev [-]

That could leave idle time where GPUs are sitting unused. It would be better to have a shared cluster that many engineers all share. And to avoid a cluster not being saturated other companies queries could also be batched. And oh wait we are back to doing AI inference in the cloud as it is an efficient way to serve AI.