Remix.run Logo
aliljet 8 hours ago

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?

seemaze 8 hours ago | parent | next [-]

Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.

marssaxman 8 hours ago | parent | prev | next [-]

I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.

oompydoompy74 8 hours ago | parent | prev | next [-]

Idk about everyone else, but I don’t want to rent tokens forever. I want a self hosted model that is completely private and can’t be monitored or adulterated without me knowing. I use both currently, but I am excited at the prospect of maybe not having to in the near to mid future.

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.

znnajdla 7 hours ago | parent | prev | next [-]

Some tasks don’t require SOTA models. For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline. Also if you can break down certain tasks like JSON healing into small focused coding tasks then local models are useful

kaliqt 6 hours ago | parent [-]

Is it really better? In which languages?

deaux 5 hours ago | parent | next [-]

Yes it is and has been for a very long time, it has been years now. Gemini 1.5 Pro is when LLM translations started significantly outperforming non-LLM machine translation, and that came out over 2 years ago.

Ever since then Google models have been the strongest at translation across the board, so it's no surprise Gemma 4 does well. Gemini 3 Flash is better at translation than any Claude or GPT model. OpenAI models have always been weakest at it, continuing to this day. It's quite interesting how these characteristics have stayed stable over time and many model versions.

I'm primarily talking about non-trivial language pairs, something like English<>Spanish is so "easy" now it's hard to distinguish the strong models.

homebrewer 3 hours ago | parent | prev [-]

I've been using gemma4 for translating Mongolian to English. It runs circles around Google Translate for that language pair, it's not even close.

jwitthuhn 5 hours ago | parent | prev | next [-]

I've been largely using Qwen3.5-122b at 6 bit quant locally for some c++/go/python dev lately because it is quite capable as long as I can give it pretty specific asks within the codebase and it will produce code that needs minimal massaging to fit into the project.

I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.

3836293648 2 hours ago | parent [-]

How much VRAM do you need for that?

seemaze a minute ago | parent [-]

I squeeze Qwen3.5-122B-A10B at Q6 into 128GB. It's a great model.

kamranjon 8 hours ago | parent | prev | next [-]

I use LMStudio to host and run GLM 4.7 Flash as a coding agent. I use it with the Pi coding agent, but also use it with the Zed editor agent integrations. I've used the Qwen models in the past, but have consistently come back to GLM 4.7 because of its capabilities. I often use Qwen or Gemma models for their vision capabilities. For example, I often will finish ML training runs, take a photo of the graphs and visualizations of the run metrics and ask the model to tell me things I might look at tweaking to improve subsequent training runs. Qwen 3.5 0.8b is pretty awesome for really small and quick vision tasks like "Give me a JSON representation of the cards on this page".

Aurornis 7 hours ago | parent | prev | next [-]

It’s easy to find a combination of llama.cpp and a coding tool like OpenCode for these. Asking an LLM for help setting it up can work well if you don’t want to find a guide yourself.

> and finding more value than just renting tokens from Anthropic of OpenAI?

Buying hardware to run these models is not cost effective. I do it for fun for small tasks but I have no illusions that I’m getting anything superior to hosted models. They can be useful for small tasks like codebase exploration or writing simple single use tools when you don’t want to consume more of your 5-hour token budget though.

toxik 5 hours ago | parent [-]

Oh lord, are the LLMs already replacing LLMs?

lkjdsklf 8 hours ago | parent | prev | next [-]

The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)

deaux 5 hours ago | parent [-]

The open weights models absolutely compete with flagship labs for most tasks. OpenAI and Anthropic's "cheap tier" models are completely uncompetitive with them for "quality / $" and it's not close. Google is the only one who has remained competitive in the <$5/1M output tier with Flash, and now has an incredibly strong release with Gemma 4.

Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.

deaux 7 hours ago | parent | prev | next [-]

While they can be run locally, and most of the discussion on HN about that, I bet that if you look at total tok/day local usage is a tiny amount compared to total cloud inference even for these models. Most people who do use them locally just do a prompt every now and then.

zozbot234 7 hours ago | parent [-]

This is why I'd like to see a lot more focus on batched inference with lower-end hardware. If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so, you don't really need top-of-the-line hardware even for SOTA results.

deaux 5 hours ago | parent [-]

> If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so

But they can't? The usage pattern is the polar opposite. Most people running these models locally just ask a few questions to it throughout the day. They want the answers now, or at least within a minute.

zozbot234 4 hours ago | parent [-]

If you want the answer right now, that alone ups your compute needs to the point where you're probably better off just using a free hosted-AI service. Unless the prompt is trivial enough that it can be answered quickly by a tiny local model.

bildung 8 hours ago | parent | prev | next [-]

The privacy/data security angle really is important in some regions and industries. Think European privacy laws or customers demanding NDAs. The value of Anthropic and OpenAI is zero for both cases, so easy to beat, despite local models being dumber and slower.

flux3125 8 hours ago | parent | prev | next [-]

They are okay for vibe coding throw-away projects without spending your Anthrophic/OAI tokens

Panda4 8 hours ago | parent | prev | next [-]

I was thinking the same thing. My only guess is that they are excited about local models because they can run it cheaper through Open Router ?

kylehotchkiss 6 hours ago | parent | prev | next [-]

I am working on a research project to link churches from their IRS Exempt org BMF entry to their google search result from 10 fetched. Gwen2.5-14b on a 16gb Mac Mini. It works good enough!

It's entertaining to see HN increasingly consider coding harness as the only value a model can provide.

dist-epoch 6 hours ago | parent | prev [-]

There are really nice GUIs for LLMs - CherryStudio for example, can be used with local or cloud models.

There are also web-UIs - just like the labs ones.

And you can connect coding agents like Codex, Copilot or Pi to local coding agents - the support OpenAI compatible APIs.

It's literally a terminal command to start serving the model locally and you can connect various things to it, like Codex.