Remix.run Logo
horsawlarway 5 hours ago

For personal use, yes.

I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.

I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.

To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.

For my personal needs, free beats $100/m.

I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).

Some example projects

- Replacement launcher for android tvs (with usage monitoring and tracking for kids)

- Custom admin portals for my k8s cluster services

- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)

- Grocery list management and meal planning (mostly via openclaw)

- some custom workflows for 3d asset generation in comfyui.

---

Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.

rootlocus 4 hours ago | parent | next [-]

2x RTX3090 are around $4400. Without any electricity costs or other parts, that's 3.6 years of $100/m claude.

overgard 3 hours ago | parent | next [-]

Assuming the $100/m claude subscription is still around in three years.

reddalo 2 hours ago | parent [-]

[dead]

freetonik 4 hours ago | parent | prev | next [-]

That's also years of top tier PC gaming, if you're into that.

augusto-moura 4 hours ago | parent [-]

2x RTX3090 is extremely overkill for gaming, you can run any released game on earth on ultra for much less

drnick1 3 hours ago | parent | next [-]

1x RTX3090 is absolutely not overkill for gaming however. Nowadays it's barely enough to get 60FPS in 4K in some recently released games. But the shocking part is that my 3090 is still probably worth as much as when I bought it about 4 years ago.

overgard 3 hours ago | parent | prev | next [-]

Having a second card doesn't really work well for gaming.

googletron 3 hours ago | parent | prev [-]

what?

kakacik 3 hours ago | parent [-]

AFAIK nvidia cards dont work in tandem (aka sli in the past) very well these days. So that aint true.

Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.

himata4113 3 hours ago | parent [-]

You can have the 2nd card as an offload for upscaling, frame generation and whatnot.

irishcoffee 2 hours ago | parent [-]

When I'm not running models I use the 2nd one in a pass-thru configuration to a windows vm for various things, usually gaming.

horsawlarway 4 hours ago | parent | prev | next [-]

Yes, today is not a great time to purchase hardware.

When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.

My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.

---

I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.

There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.

You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.

If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.

You'll spend less on power too.

My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.

tracker1 2 hours ago | parent [-]

If you're willing to go the AMD route, the AMD Radeon Pro R9700 definitely looks interesting for the price compared to NVidia.

felooboolooomba 2 minutes ago | parent [-]

Can we also run LLMs on Radeon?

jmuguy 3 hours ago | parent | prev | next [-]

Or a really excellent experience playing Satisfactory with the settings cranked up, which is priceless.

tripleee 3 hours ago | parent | prev | next [-]

Christ GPU prices have gotten crazy

How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM

overgard 3 hours ago | parent | next [-]

In my personal experience, I wouldn't bother with 16GB cards for coding -- the useful models are _slightly_ too large to work at any reasonable speed

lambda 3 hours ago | parent | prev [-]

That should do pretty well. Memory bandwidth is the biggest bottleneck for token generation, at 644 GB/s you should be able to do pretty well on a 9070, while prompt proessing is more compute bound and Nvidia tends to have the edge there.

16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.

tracker1 2 hours ago | parent [-]

You can get an R9700 with 32gb vram for ~$1200-1400 depending on where you live, which is probably a better option for AI use than 2x 9070(xt)

lambda 33 minutes ago | parent [-]

Yeah, definitely.

nyrikki 4 hours ago | parent | prev | next [-]

You can get 60tps with three 1080tis and the sparse model, and I bet two 16gb 5060tis would do the same for ~1200. One 3090 is enough for a useful system, even on an old am4 host.

flowerthoughts 3 hours ago | parent | prev | next [-]

In 3.6 years, chances are they are still worth $3k. Unless some new chip fab pops up that can spam the chip market. Even if the AI bubble bursts, I doubt we'll see high-RAM GPUs sell off.

sieabahlpark 4 hours ago | parent | prev [-]

[dead]

kpw94 4 hours ago | parent | prev | next [-]

> gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models

Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !

- https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF (With "Jun 9 Update: Added MTP support.")

- https://blog.google/innovation-and-ai/technology/developers-...

SubiculumCode 31 minutes ago | parent | next [-]

How is the the QAT models at coding? I looked for opinions since the release and haven't found much.

me_bx 2 hours ago | parent | prev [-]

TIL:

> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model

twothreeone 4 hours ago | parent | prev | next [-]

> unsloth/Qwen3.6-35B-A3B-MTP-GGUF

I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.

The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.

It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.

Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.

horsawlarway 3 hours ago | parent | next [-]

I don't generally switch to implementing myself on the model, although there are definitely times where I stop it and correct it mid-task.

It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.

I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).

I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".

I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.

I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.

Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.

unethical_ban 3 hours ago | parent | prev [-]

I'm so out of the loop on this stuff, it's the first time in my IT career I feel really behind on things.

I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.

I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?

gonzalohm 4 hours ago | parent | prev | next [-]

Did you double the tokens per second by adding a second GPU or was the increase significantly less?

horsawlarway 4 hours ago | parent | next [-]

No real change in inference speed. It basically just allows me to slot in more context or a bigger model.

A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.

Sometimes that matters, a lot of times it doesn't.

On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.

I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).

mirekrusin 4 hours ago | parent | prev [-]

You’re adding extra gpu for more vram, not speed.

anhtqweb 2 hours ago | parent | prev | next [-]

Grocery list management and meal planning sounds interesting. Would you mind sharing a little bit more on your use case please?

agup792 4 hours ago | parent | prev [-]

That sounds amazing. If I had some GPUs sitting around, I would totally do it. Sounds expensive to do it otherwise though.