> At a 64k token window, that easily gets into the 80+ gigabyte of video memory range, which is completely unsustainable for individuals to host themselves.

A desktop computer in that performance tier (e.g. an AMD AI Max+ 395 with 128 GB of shared memory) is expensive but not prohibitively so. Depending on where you live, one year of therapy may cost more than that.

▲

jchw 3 days ago | parent [-]

It seems like the Framework Desktop has become one of the best choices for local AI on the whole market. At a bit over $2,000 you can get a machine that can have, if I understand correctly, around 120 GiB of accessible VRAM, and the seemingly brutal Radeon 8060S, whose iGPU performance appears to only be challenged by a fully loaded Apple M4 Max, or of course a sufficiently big dGPU. The previous best options seem to be Apple, but for a similar amount of VRAM I can't find a similarly good deal. (The last time I could find an Apple Silicon device that sold for ~$2,000 with that much RAM on eBay, it was an M1 Ultra.)

I am not really dying to run local AI workloads, but the prospect of being able to play with larger models is tempting. It's not $2,000 tempting, but tempting.

▲

Aurornis 3 days ago | parent | next [-]

FYI there are a number of Strix Halo boards and computers out in the market already. The Framework version looks to be high quality and have good support, but it’s not the only option in this space.

Also take a good hard look at the token output speeds before investing. If you’re expecting quality, context windows, and output speeds similar to the hosted providers you’re probably going to be disappointed. There are a lot of tradeoffs with a local machine.

▲

jchw 3 days ago | parent | next [-]

> Also take a good hard look at the token output speeds before investing. If you’re expecting quality, context windows, and output speeds similar to the hosted providers you’re probably going to be disappointed. There are a lot of tradeoffs with a local machine.

I don't really expect to see performance on-par with the SOTA hosted models, but I think I'm mainly curious what you could possibly do with local models that would otherwise not be doable with hosted models (or at least, stuff you wouldn't want to for other reasons, like privacy.)

One thing I've realized lately is that Gemini, and even Gemma, are really, really good at transcribing images, much better and more versatile than OCR models as they can also describe the images too. With the realization that Gemma, a model you can self-host, is good enough to be useful, I have been tempted to play around with doing this sort of task locally.

But again, $2,000 tempted? Not really. I'd need to find other good uses for the machine than just dicking around.

In theory, Gemma 3 27B BF16 would fit very easily in system RAM on my primary desktop workstation, but I haven't given it a go to see how slow it is. I think you mainly get memory bandwidth constrained on these CPUs, but I wouldn't be surprised if the full BF16 or a relatively light quantization gives tolerable t/s.

Then again, right now, AI Studio gives you better t/s than you could hope to get locally with a generous amount of free usage. So ... maybe it would make sense to wait until the free lunch ends, but I don't want to build anything interesting that relies on the cloud, because I dislike the privacy implications of it, even though everything I'm interested in doing is fully safe with the ToS.

	▲	bytefactory 2 days ago \| parent [-]
		I had long been of the opinion that local models were a long way away from being useful, and that they were toys at best. I'm a heavy user of o3/GPT5, Claude Opus/Sonnet and Gemini 2.5 Pro, so my expectations were sky high. I tried out Gemma 27B on LM Studio a few days ago, and I was completely blown away! It has a warmth and character (and smarts!) that I was not expecting in a tiny model. It just doesn't have tool use (although there are hacky workarounds), which would have made it even better. Qwen 3 with 30B parameters (3B active) seems to be nearly as capable, but also supports tool use. I'm currently in the process of vibe coding an agent network with LangGraph orchestration, Gemma 27B/Qwen 3 30B-A3B with memory, context management and tool management. The Qwen model even uses a tiny 1.7B "draft" model for speculative decoding improving performance. In my 7800x3D, RTX 4090 with 64GB RAM, I have latency of ~200-400ms, and 20-30 tokens/s which is plenty fast. My thought process is that this local stack will let me use agents to their fullest in administering my machine. I always felt uneasy letting Claude Code, Gemini CLI or Codex operate outside my code folders. Yet, their utility in helping me troubleshoot problems (I'm a recent Linux convert) was too attractive to ignore. Now I have the best of both worlds. Privacy, and AI models helping with sysadmin. They're also great for quick "what options does kopia backup use?" type questions I've assigned a global hotkeyed helper for. Additionally, if one has a NAS with the *arr stack for downloading, say perfectly legal Linux ISOs, such a private model would be far more suitable. It's early days, but I'm excited about other use cases i might discover over time! It's a good time to be an AI enthusiast.

▲

walterbell 3 days ago | parent | prev [-]

HP Z2 Mini G1a with 128GB and Strix Halo is ~$5K, https://www.notebookcheck.net/Z2-Mini-G1a-HP-reveals-compara...

▲

layer8 3 days ago | parent | prev [-]

There are a dozen or more (mostly Chinese) manufacturers coming out with mini PCs based on that Ryzen AI Max+ 395 platform, like for example the Bosgame M5 AI Mini for just $1699 with 128GB. Just pointing out that this configuration is not a Framework exclusive.

	▲	jchw 3 days ago \| parent [-]
		That's true. Going for pure compute value, it does seem you can do even better. Last I looked at Strix Halo products all else I could find seemed like laptop announcements, and laptops are obviously going to generally be even more expensive.