I am currently using a local model qwen3:8b running on a 2020 (2018 intel chip) Mac mini for classifying news headlines and it's working decently well for my task. Each headline takes about 2-3 seconds but is pretty accurate. Uses about 5.3 gigs of ram.

▲

darkwater a day ago | parent [-]

Can you expand a bit on your software setup? I thought running local models was restricted to having expensive GPUs or latest Apple Silicon with unified memory. I have a Intel 11th gen home server which I would like to use to run some local model for tinkering if possible.

▲

simonw a day ago | parent | next [-]

Those little 4B and 8B models will run on almost anything. They're really fun to try out but severely limited in comparison to the larger ones - classifying headlines to categories should work well but I wouldn't trust them to refactor code!

If you have 8GB of RAM you can even try running them directly in Chrome via WebAssembly. Here's a demo running a model that's less than 1GB to load, entirely in your browser (and it worked for me in mobile safari just now): https://huggingface.co/spaces/cfahlgren1/Qwen-2.5-WebLLM

▲

marmarama a day ago | parent | prev | next [-]

It's really just a performance tradeoff, and where your acceptable performance level is.

Ollama, for example, will let you run any available model on just about any hardware. But using the CPU alone is _much_ slower than running it on any reasonable GPU, and obviously CPU performance varies massively too.

You can even run models that are bigger than available RAM too, but performance will be terrible.

The ideal case is to have a fast GPU and run a model that fits entirely within the GPU's memory. In these cases you might measure the model's processing speed in tens of tokens per second.

As the idealness decreases, the processing speed decreases. On a CPU only with a model that fits in RAM, you'd be maxing out in the low single digit tokens per second, and on lower performance hardware, you start talking about seconds over token instead. If the model does not fit in RAM, then the measurement is minutes per token.

For most people, their minimum acceptable performance level is in the double digit tokens per second range, which is why people optimize for that with high-end GPUs with as much memory as possible, and choose models that fit inside the GPU's RAM. But in theory you can run large models on a potato, if you're prepared to wait until next week for an answer.

	▲	mark_l_watson a day ago \| parent [-]
		+1 > It's really just a performance tradeoff, and where your acceptable performance level is. I am old enough to remember developers respecting the economics of running the software they create. Ollama running locally paired occasionally with using Ollama Cloud when required is a nice option if you use it enough. I have twice signed up and paid $20/month for Ollama Cloud, love the service, but use it so rarely (because local models so often are sufficient) that I cancelled both times. If Ollama ever implements a pay as you go API for Ollama Cloud, then I will be a long term customer. I like the business model of OpenRouter but I enjoy using Ollama Cloud more. I am probably in the minority, but I wish subscription plans would go away and Claude Code, gemini-cli, codex, etc. would all be only available pay as you go, with ‘anti dumping’ laws applied to running unsustainable businesses. I don’t mean to pick on OpenAI, but I think the way they fund their operations actually helps threaten the long term viability of our economy. Our government making the big all-in bet on AI dominance seems crazy to me.

▲

busymom0 21 hours ago | parent | prev [-]

It really is a very simple setup. I basically had an old Intel based Mac mini from 2020. The intel chip inside it is from 2018). It's a 3 GHz 6-core Core i5. I had upgraded the ram on it to 32 GB when I bought it. However, the ollama only uses about 5.5 gigs of it. So it can be run on 16gb Mac too.

The Qwen model I am using is fairly small but does the job I need it to for classifying headlines pretty decently. All I ask it to do is whether a specific headline is political or not. It only responds to me with in True or False.

I access this model from an app (running locally) using the `http://localhost:11434/api/generate` REST api with `think` set to false.

Note that this qwen model is a `thinking` model. So disabling it is important. Otherwise it takes very long to respond.

Note that I tested this on my newer M4 Mac mini too and there, the performance is a LOT faster.

Also, on my new M4 Mac, I originally tried using the Apple's built in Foundation Models for this task and while it was decent, many times, it was hitting Apple's guardrails and refusing to respond because it claimed the headline was too sensitive. So I switched to the Qwen model which didn't have this problem.

Note that while this does the job I need it to, as another comment said, it won't be much help for things like coding.