Seems to completely ignore usage of local/free models as well as anything but Sonnet/ChatGPT. So my confidence in the good faith of the author is... heavily restricted.

▲

nicce a day ago | parent | next [-]

Most people can’t affort the GPUs for local models if you want to get close to cloud capabilities.

▲

rhdunn a day ago | parent | next [-]

A 4090 has 24GB of VRAM allowing you to run a 22B model entirely in memory at FP8 and 24B models at Q6_K (~19GB).

A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.

You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.

This means that you can run the Qwen3-Coder-30B-A3B model locally on a 4090 or 5090. That model is a Mixture of Experts model with 3B active parameters, so you really only need a card with 3B of VRAM so you could run it on a 3090.

The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.

▲

cmclaughlin 28 minutes ago | parent | next [-]

I also expect local LLMs to catch up to the cloud providers.

I spent last weekend experimenting with Ollama and LM studio. I was impressed at how good Qwen3-Coder is. Not as good as Claude, but close - maybe even better in some ways.

As I understand it, the latest Macs are good for local LLMs due to their unified memory. 32GB of RAM in one of the newer M-series seems to be the "sweet spot" for price versus performance.

▲

iberator a day ago | parent | prev | next [-]

That's out of touch for 90% of developers worldwide

▲

brazukadev a day ago | parent [-]

Today. But what about in 5 years? Would you bet we will be paying hundreds of billions to OpenAI yearly or buying consumer GPUs? I know what I will be doing.

▲

Dilettante_ a day ago | parent | next [-]

But the progress goes both ways: In five years, you would still want to use whatever is running on the cloud supercenters. Just like today you could run gpt-2 locally as a coding agent, but we want the 100x-as-powerful shiny thing.

▲

mcny a day ago | parent | next [-]

That would be great if that was the case but my understanding is that the progress is plateauing. I don't know how much of this is anthorpic / Google / openAI holding itself back to save money and how much is the state of the art improvement slowing down though. I can imagine there could be a 64 GB GPU in five years as absurd as it feels to type that today.

▲

simonw a day ago | parent | next [-]

What gives you the impression the progress is plateauing?

I'm finding the difference just between Sonnet 4 and Sonnet 4.5 to be meaningful in terms of the complexity of tasks I'm willing to use them for.

▲

sebastiennight a day ago | parent | prev [-]

> a 64 GB GPU in five years

Is there a digit missing? I don't understand why this existing in 5 years is absurd

	▲	mcny 5 hours ago \| parent [-]
		I meant for me it feels absurd today but it will likely happen in five years.

▲

brazukadev a day ago | parent | prev [-]

Not really, for many cases I'm happy using Qwen3-8B in my computer and would be very happy if I could run Qwen3-Coder-30B-A3B.

▲

infecto a day ago | parent | prev | next [-]

Paying for compute in the cloud. That’s what I am betting on. Multiple providers, different data center players. There may be healthy margins for them but I would bet it’s always going to be relatively cheaper for me to pay for the compute rather than manage it myself.

▲

alfiedotwtf 20 hours ago | parent | prev [-]

Woah, woah, woah. I thought in 5 years time we would all be out of a job lol

▲

reaslonik a day ago | parent | prev | next [-]

You need to leave much more room for context if you want to do useful work besides entertainment. Luckily there are _several_ PCIe slots on a motherboard. New Nvidia cards at retail(or above) are not the only choice for building a cluster; I threw a pile of Intel Battlemage cards on it and got away with ~30% of the nvidia cost for same capacity (setup was _not_ easy in early 2025 though).

You can gain a lot of performance by using optimal quantization techniques for your setup(ix, awq etc), different llamacpp builds do different between each other and very different compared to something like vLLM

▲

jen729w a day ago | parent | prev | next [-]

Honestly though how many people reading this do you think have that setup vs. 85% of us being on a MBx?

> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

▲

radicalbyte a day ago | parent | next [-]

> Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

The good old days of having to do crazy nutty things to get Elite II: Frontier, Magic Carpet, Worms, Xcom: UFO Enemy Unknown, Syndicate et cetera to actually run on my PC :-)

	▲	alfiedotwtf 20 hours ago \| parent [-]
		That crazy Burt thing these days, is quitting Chrome because it’s consuming 90% ram

▲

reaslonik a day ago | parent | prev [-]

>I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

As long as it's within terms and conditions of whatever agreement you made for that $20. I can run queries on my own inference setup from remote locations too

	▲	a day ago \| parent [-]
		[deleted]

▲

Foobar8568 a day ago | parent | prev | next [-]

Yes but they are really less performant than claude code or codex. I really cried with the 20-25GB models ( 30b Qwen, Devstral etc). They really don't hold a candle, I didn't think the gap was this large or maybe Claude code and GPT performs much better than I imagined.

▲

ashirviskas a day ago | parent | prev | next [-]

How much context do you get with 2GB of leftover VRAM on Nvidia GPU?

▲

electroglyph a day ago | parent | prev [-]

you need a couple RTX 6000 pros to come close to matching cloud capability

▲

s1mplicissimus a day ago | parent | prev | next [-]

Most people I know can't afford to leak business insider information to 3rd party SaaS providers, so it's unfortunately not really an option.

▲

ruszki a day ago | parent | next [-]

But… they do all the time. Almost everybody uses some mix of Office, Slack, Notion, random email providers, random “security” solutions etc. The exception is the opposite. The only thing prevents info leaking is ToS, and there are options for that even with LLMs. Nothing changed regarding that.

▲

Antibabelic a day ago | parent | next [-]

In my personal experience, it's very common for big companies to host email, messengers, conferencing software on their own servers.

▲

infecto a day ago | parent | next [-]

In my experience it’s very common for big companies to not host. Think Fortune 500 type companies. Most are legally happy with their MSA and reasonably confident in security standards.

▲

s1mplicissimus a day ago | parent | prev | next [-]

> In my personal experience, it's very common for big companies to host email, messengers, conferencing software on their own servers.

Mind sharing a clarification on your understanding of "common" and "big"?

▲

ruszki a day ago | parent | prev [-]

Yes, then they use Outlook for example. Have you checked the ToS of the new Outlook version for commoners? They flat out state that they can use all of your emails for whatever they want.

Also, companies host for example an Exchange server on prem; and guess, what it connects to? Why you can usually access account at outlook.com?

▲

consonaut a day ago | parent [-]

Your on premise exchange server has zero connections to outlook.com. OWA (Outlook Web Access) looks similar to outlook.com but has otherwise nothing to do with it.

	▲	ruszki 18 hours ago \| parent [-]
		Then let’s call outlook.office.com, and that’s an OWA, and you’re redirected there if you login on outlook.com. And that’s the exception when on prem Exchange servers really work as more than mere proxies nowadays. I’m quite sure that there are still real fully on prem solutions, but it’s laughable these opinions, that most company really care about this. They simply don’t.

▲

omgmajk a day ago | parent | prev [-]

All of those things are hosted on-prem in the bigger orgs I have worked in.

	▲	simonw a day ago \| parent [-]
		I don't think Slack or Notion have on-prem/self-hosted options.

▲

infecto a day ago | parent | prev [-]

This is a poor take imo. Depends on the industry but the worlds businesses run on the shoulders of companies like Microsoft and heavily use OneDrive/Sharepoint. Most entities, even those with sensitive information are legally comfortable with that arrangement. Using a LLM does not change much so long as the MSA is similar.

▲

s1mplicissimus a day ago | parent [-]

> Depends on the industry but the worlds businesses run on the shoulders of companies like Microsoft and heavily use OneDrive/Sharepoint

I am sure MS employees need to tell themselves that to sleep well. The statement itself doesn't seem to hold much epistemological value above that though.

	▲	infecto a day ago \| parent [-]
		It goes in direct conflict with your idea. I am sure you know some people within your circle that say they cannot leak data but the fact remains. Over 85% of Fortune 500 companies use some combo of OneDrive or Sharepoint. The companies have already gotten familiar with the risks and legally are comfortable with the MSAs. So I am not sure what legs you are standing on. Absolutely there are specific companies or industries where they think the risk is too great but for many, outsourcing the process is either the same or less risk then doing it all inhouse.

▲

EagnaIonat a day ago | parent | prev | next [-]

The more recent LLMs work fine on an M1 mac. Can't speak for Windows/Linux.

There was even a recent release of Granite4 that runs on a Raspberry Pi.

https://github.com/Jewelzufo/granitepi-4-nano

For my local work I use Ollama. (M4 Max 128GB)

- gpt-oss. 20b or 120b depending on complexity of use cases.

- granite4 for speed and lower complexity (around the same as gpt20b).

▲

whitehexagon a day ago | parent | prev | next [-]

Agreed, GPU is the expensive route, especially when I was looking at external GPU solutions.

Using Qwen3:32b on a 32GB M1 Pro may not be "close to cloud capabilities" but it is more than powerful enough for me, and most importantly, local and private.

As a bonus, running Asahi Linux feels like I own my Personal Computer once again.

	▲	mark_l_watson a day ago \| parent [-]
		I agree with you (I have a 32G M2Pro) and I like to mix using local models running with Ollama and LM Studio with using gemini-cli (used to also occasionally use codex but I just cancelled my $20/month OpenAI subscription - I like their products but I don’t like their business model, so I lose out now on that option). Running smaller models on Apple Silicon is kinder on the environment/energy use and has privacy benefits for corporate use. Using a hybrid approach makes sense for many use cases. Everyone gets to make their own decisions; for me, I like to factor in externalities like social benefit, environment, and wanting the economy to do as well as it can in our new post-mono polar world.

▲

Tepix a day ago | parent | prev [-]

Isn't the point that you don't need SOTA capabilities all the time?

▲

pistoriusp a day ago | parent | prev | next [-]

Do you use a local/ free model?

▲

s1mplicissimus a day ago | parent | next [-]

Yes, for the little it's good I'm currently using LMStudio with varying models

▲

busymom0 a day ago | parent | prev [-]

I am currently using a local model qwen3:8b running on a 2020 (2018 intel chip) Mac mini for classifying news headlines and it's working decently well for my task. Each headline takes about 2-3 seconds but is pretty accurate. Uses about 5.3 gigs of ram.

▲

darkwater a day ago | parent [-]

Can you expand a bit on your software setup? I thought running local models was restricted to having expensive GPUs or latest Apple Silicon with unified memory. I have a Intel 11th gen home server which I would like to use to run some local model for tinkering if possible.

▲

simonw a day ago | parent | next [-]

Those little 4B and 8B models will run on almost anything. They're really fun to try out but severely limited in comparison to the larger ones - classifying headlines to categories should work well but I wouldn't trust them to refactor code!

If you have 8GB of RAM you can even try running them directly in Chrome via WebAssembly. Here's a demo running a model that's less than 1GB to load, entirely in your browser (and it worked for me in mobile safari just now): https://huggingface.co/spaces/cfahlgren1/Qwen-2.5-WebLLM

▲

marmarama a day ago | parent | prev | next [-]

It's really just a performance tradeoff, and where your acceptable performance level is.

Ollama, for example, will let you run any available model on just about any hardware. But using the CPU alone is _much_ slower than running it on any reasonable GPU, and obviously CPU performance varies massively too.

You can even run models that are bigger than available RAM too, but performance will be terrible.

The ideal case is to have a fast GPU and run a model that fits entirely within the GPU's memory. In these cases you might measure the model's processing speed in tens of tokens per second.

As the idealness decreases, the processing speed decreases. On a CPU only with a model that fits in RAM, you'd be maxing out in the low single digit tokens per second, and on lower performance hardware, you start talking about seconds over token instead. If the model does not fit in RAM, then the measurement is minutes per token.

For most people, their minimum acceptable performance level is in the double digit tokens per second range, which is why people optimize for that with high-end GPUs with as much memory as possible, and choose models that fit inside the GPU's RAM. But in theory you can run large models on a potato, if you're prepared to wait until next week for an answer.

	▲	mark_l_watson a day ago \| parent [-]
		+1 > It's really just a performance tradeoff, and where your acceptable performance level is. I am old enough to remember developers respecting the economics of running the software they create. Ollama running locally paired occasionally with using Ollama Cloud when required is a nice option if you use it enough. I have twice signed up and paid $20/month for Ollama Cloud, love the service, but use it so rarely (because local models so often are sufficient) that I cancelled both times. If Ollama ever implements a pay as you go API for Ollama Cloud, then I will be a long term customer. I like the business model of OpenRouter but I enjoy using Ollama Cloud more. I am probably in the minority, but I wish subscription plans would go away and Claude Code, gemini-cli, codex, etc. would all be only available pay as you go, with ‘anti dumping’ laws applied to running unsustainable businesses. I don’t mean to pick on OpenAI, but I think the way they fund their operations actually helps threaten the long term viability of our economy. Our government making the big all-in bet on AI dominance seems crazy to me.

▲

busymom0 21 hours ago | parent | prev [-]

It really is a very simple setup. I basically had an old Intel based Mac mini from 2020. The intel chip inside it is from 2018). It's a 3 GHz 6-core Core i5. I had upgraded the ram on it to 32 GB when I bought it. However, the ollama only uses about 5.5 gigs of it. So it can be run on 16gb Mac too.

The Qwen model I am using is fairly small but does the job I need it to for classifying headlines pretty decently. All I ask it to do is whether a specific headline is political or not. It only responds to me with in True or False.

I access this model from an app (running locally) using the `http://localhost:11434/api/generate` REST api with `think` set to false.

Note that this qwen model is a `thinking` model. So disabling it is important. Otherwise it takes very long to respond.

Note that I tested this on my newer M4 Mac mini too and there, the performance is a LOT faster.

Also, on my new M4 Mac, I originally tried using the Apple's built in Foundation Models for this task and while it was decent, many times, it was hitting Apple's guardrails and refusing to respond because it claimed the headline was too sensitive. So I switched to the Qwen model which didn't have this problem.

Note that while this does the job I need it to, as another comment said, it won't be much help for things like coding.

▲

NumerousProcess a day ago | parent | prev | next [-]

Augment doesn't support local models or anything else other than Claude/GPT

▲

moffkalast a day ago | parent | prev [-]

I think it's also true for many local models. People still use NeMo, QwQ, Llama3 for use cases that fit them despite there being replacements that do better on "benchmarks". Not to mention relics like BERT that are still tuned for classification even today. ML models always have weird behaviours and a successor is unlikely to be better in literally every way, once you have something that works well enough it's hard to upgrade without facing different edge cases.

Inference for new releases is routinely bugged for at least a month or two as well, depending on how active the devs of a specific inference engine are and how much model creators collaborate. Personally, I hate how data from GPT's few week (and arguably somewhat ongoing) sycophancy rampage has leaked into datasets that are used for training local models, making a lot of new LLM releases insufferable to use.