> within a few years we will be running local models as good as today’s frontier

Unless there isn't some important breakthrough in hw production or in models architecture, it's quite the opposite: bigger, more expensive and more energy-intensive hw is needed today compared to 1 or 2 years ago.

▲

evgen 3 hours ago | parent | next [-]

I can run qwen3.6-27b on a four year-old Macbook Pro that dominates ChatGPT-4o (the frontier model from 2 years ago) and is competetitve against early ChatGPT-5 versions. We are also getting a lot smarter about using and deploying these local models. Your entire AI stack from two years ago would be absolutely crushed by a todays local LLM models and a high-end local inference system when combined with a good modern coding agent.

	▲	vb-8448 23 minutes ago \| parent [-]
		Today open weights frontier models cannot run locally, unless quantization is used. Deep seek v4 pro require almost 1 TB of RAM in INT4. I hardly doubt there will be consumer grade HW to run it in 2 years either. And deep seek v4 pro is not even close to OAI or anthropic frontier models.

▲

chermi 3 hours ago | parent | prev | next [-]

Per frontier token. You're not calculating the cost of a fixed quality asset here. Old hw running non-frontier models will be very valuable. In fact, we have two direct examples: older server gpus actually appreciating and the very obvious fact that not everyone always use MAX FULL EFFORT BEST MODEL no matter what.

	▲	2 hours ago \| parent [-]
		[deleted]

▲

ls612 3 hours ago | parent | prev [-]

As good as today’s frontier. Gemma 4 today is roughly equivalent to the frontier a year and a half ago at gpt 4o tier.

▲

antisthenes 3 hours ago | parent [-]

What's the cheapest PC you can buy today that will comfortably run Gemma 4 and everything else you want it to run at the same time?

And how many tokens would that buy?

▲

ls612 3 hours ago | parent [-]

I run it on my 4 year old MBP and get 10 tok/s. With the RAM shortage buying anything new today is a nightmare but anyone with a reasonably modern Mac could run it at q6 probably. It is mostly a toy as 4o models weren’t really suitable for real work IMO but at least it won’t ever give me a refusal.

▲

jazzyjackson 2 hours ago | parent [-]

At 10toks, are you using it interactively or do you submit a prompt and come back to it later? I always thought it would make sense to just do conversations over email, asynchronously, the model can take all the time it needs and get back to me when it has an answer.

	▲	ls612 an hour ago \| parent [-]
		10 tok/s is around the borderline of interactive being good. I did the math and it is mostly bottlenecked by memory bandwidth, so in the future I can expect to run a similarly sized model on my 4090 once it gets retired from gaming service and get ~25 tok/s which will be very usable.