Maybe for folks who are deep into this, but it’s not exactly accessible. I tried reading up on it a couple of months ago, but parsing through what hardware I needed, the model and how to configure it (model size vs quantization), how I’d get access to the hardware (which for decent results in coding, new hardware runs $4k-$10k last I checked)—it had a non trivial barrier of entry. I was trying to do this over a long weekend and ran out of time. I’ll have to look into it again because having the local option would be great.

Edit: the replies to my comment are great examples of what I’m talking about when I say it’s hard to determine what hardware I’d need :).

▲

imetatroll 3 hours ago | parent | next [-]

For me the big hangup is the hardware. If I could find a simple guide to putting together a machine that I can run off an outlet in my home, I am sold. The problem is that I haven't found this yet (though I suppose I haven't looked very hard either).

▲

jonaustin 7 hours ago | parent | prev | next [-]

Just get a decent macbook, use LM Studio or OMLX and the latest qwen model you can fit in unified ram.

Hooking up Claude Code to it is trivial with omlx.

https://github.com/jundot/omlx

▲

root_axis 8 hours ago | parent | prev [-]

> new hardware runs $4k-$10k last I checked

Starting closer to 40k if you want something that's practical. 10k can't run anything worthwhile for SDLC at useful speeds.

▲

zozbot234 7 hours ago | parent [-]

$10K should be enough to pay for a 512GB RAM machine which in combination with partial SSD offload for the remaining memory requirements should be able to run SOTA models like DS4-Pro or Kimi 2.6 at workable speed. It depends whether MoE weights have enough locality over time that the SSD offload part is ultimately a minor factor.

(If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)

▲

SwellJoe 7 hours ago | parent [-]

You can't put "SSD offload" and "workable speed" in the same sentence.

	▲	zozbot234 6 hours ago \| parent [-]
		As a typical example DeepSeek v4-pro has 59B active params at mostly FP4 size, so it needs to "find" around 30GB worth of params in RAM per inferred token. On a 512GB total RAM machine, most of those params will actually be cached in RAM (model size on disk is around 862GB), so assuming for the sake of argument that MoE expert selection is completely random and unpredictable, around 15GB in total have to be fetched from storage per token. If MoE selection is not completely random and there's enough locality, that figure actually improves quite a bit and inference becomes quite workable.