I think it's niche now because getting the hardware to run it is expensive and the quantized models don't work as well. If those improve then it would be a no brainer to pay one off for the hardware instead of a fortune for API calls.

▲

dofm 5 hours ago | parent | next [-]

I am not really convinced that four bit quantisation is that bad; almost certainly six will be enough. But Google are making claims for their QAT tech in Gemma that they are surely using or testing in Gemini that it preserves nearly source model quality while reducing footprint.

The hardware for 50 tokens per second with a four bit quantisation of Gemma 4 26B or the sparse Qwen 3.6 is not really that expensive: it’s a secondhand M1 Max.

Beyond that, I agree. I think moving planning tasks to local is a now thing, not that it really has much impact on token spend. I also think many small coding tasks are fully within the grasp of the above two models.

The main issue right now is that the software landscape is rather confusing, but I reckon uncomplicated Gemma 4 26B QAT support with MTP is a few weeks away.

▲

2 hours ago | parent | prev | next [-]

[deleted]

▲

jqpabc123 5 hours ago | parent | prev [-]

AI vendors are attempting to offer the whole apple. And they are spending huge sums of money in the process.

But most businesses don't really care about most of the apple --- they only need their special bite out of it.

For example, doctors mainly care about medicine. Nvidia is attempting to provide the hardware needed for local, specialized models.

	▲	dofm 5 hours ago \| parent [-]
		I think it is likely to appeal to video and photo editors who want to use AI tools (the press release has a quote from Blackmagic Design, as well as from Adobe, who I think have no stomach for their own cloud AI). But I don’t know about specialised: this could run quite large models with MoE.