▲	jchw 3 days ago
		> Also take a good hard look at the token output speeds before investing. If you’re expecting quality, context windows, and output speeds similar to the hosted providers you’re probably going to be disappointed. There are a lot of tradeoffs with a local machine. I don't really expect to see performance on-par with the SOTA hosted models, but I think I'm mainly curious what you could possibly do with local models that would otherwise not be doable with hosted models (or at least, stuff you wouldn't want to for other reasons, like privacy.) One thing I've realized lately is that Gemini, and even Gemma, are really, really good at transcribing images, much better and more versatile than OCR models as they can also describe the images too. With the realization that Gemma, a model you can self-host, is good enough to be useful, I have been tempted to play around with doing this sort of task locally. But again, $2,000 tempted? Not really. I'd need to find other good uses for the machine than just dicking around. In theory, Gemma 3 27B BF16 would fit very easily in system RAM on my primary desktop workstation, but I haven't given it a go to see how slow it is. I think you mainly get memory bandwidth constrained on these CPUs, but I wouldn't be surprised if the full BF16 or a relatively light quantization gives tolerable t/s. Then again, right now, AI Studio gives you better t/s than you could hope to get locally with a generous amount of free usage. So ... maybe it would make sense to wait until the free lunch ends, but I don't want to build anything interesting that relies on the cloud, because I dislike the privacy implications of it, even though everything I'm interested in doing is fully safe with the ToS.
	▲	bytefactory 2 days ago \| parent [-]
		I had long been of the opinion that local models were a long way away from being useful, and that they were toys at best. I'm a heavy user of o3/GPT5, Claude Opus/Sonnet and Gemini 2.5 Pro, so my expectations were sky high. I tried out Gemma 27B on LM Studio a few days ago, and I was completely blown away! It has a warmth and character (and smarts!) that I was not expecting in a tiny model. It just doesn't have tool use (although there are hacky workarounds), which would have made it even better. Qwen 3 with 30B parameters (3B active) seems to be nearly as capable, but also supports tool use. I'm currently in the process of vibe coding an agent network with LangGraph orchestration, Gemma 27B/Qwen 3 30B-A3B with memory, context management and tool management. The Qwen model even uses a tiny 1.7B "draft" model for speculative decoding improving performance. In my 7800x3D, RTX 4090 with 64GB RAM, I have latency of ~200-400ms, and 20-30 tokens/s which is plenty fast. My thought process is that this local stack will let me use agents to their fullest in administering my machine. I always felt uneasy letting Claude Code, Gemini CLI or Codex operate outside my code folders. Yet, their utility in helping me troubleshoot problems (I'm a recent Linux convert) was too attractive to ignore. Now I have the best of both worlds. Privacy, and AI models helping with sysadmin. They're also great for quick "what options does kopia backup use?" type questions I've assigned a global hotkeyed helper for. Additionally, if one has a NAS with the *arr stack for downloading, say perfectly legal Linux ISOs, such a private model would be far more suitable. It's early days, but I'm excited about other use cases i might discover over time! It's a good time to be an AI enthusiast.