Yeah that 60-150b~ range is such a sweet spot for current 'prosumer' hardware, I'd love to see something like a 120b-a14b or there about.

▲

tarruda a month ago | parent | next [-]

I have a 128G mac studio and even 397B was a happy surprise to me due to its high quantization resilience.

I've created a 2.54BPW quant that fit on my hardware with 128k context, 20 tps tg and 200tps pp, while maintaining high scores on many benchmarks: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discus...

▲

chrisweekly a month ago | parent | next [-]

Apple store's current options for mac studio seem to max out at 96GB. I'm questioning ROI, esp. given it's not upgradeable. Curious about others' takes on new mac hardware.

▲

tarruda a month ago | parent | next [-]

> I'm questioning ROI

If by ROI you mean saving more money than using paid APIs, then I don't think it is worth it. All you gain is full sovereignty over your AI usage.

▲

hadlock a month ago | parent | prev | next [-]

Rumor mill has been buzzing about m5 mini and studio. If anything materializes close to what the rumor mill has been suggesting, the m5 could be appealing to home lab/local LLM folks, or at least help inform if the M6 will be worthwhile. Assuming Apple was able to lock in halfway reasonable memory prices early enough in advance.

▲

drob518 a month ago | parent | prev | next [-]

Currently, Apple is letting some of its models go out of stock in preparation for new models coming in a few weeks. I would expect at least 128 GB models at that time. That said, the memory crunch is hitting everyone.

▲

the_lucifer a month ago | parent [-]

Yep, even with their supply chain prowess, they're being hit now given some longer term contracts vis-à-vis their memory are nearing renewals.

	▲	drob518 a month ago \| parent [-]
		Yep. Something needs to break soon. Or rather, something WILL break soon, one way of another. Was talking to a friend last night who works planning infrastructure rollout and he said costs for equipment has roughly doubled in the last six months. Soon, these projects aren’t going to be viable.

▲

ramses0 a month ago | parent | prev [-]

I'd held off from buying a new personal laptop for quite a few years and felt that the M5-128gb was justifiable once I started really seeing payoffs from using AI at work.

Running w/ Cursor and doing some "nights and weekends" type coding / conversations, I was hitting $100-200 of usage within a few weeks. I know there's probably better ways to manage costs, but I was getting enough value out of it to keep bumping my spend limit from $20 => $40 => $80 => $120 (and then I stopped spending! :-)

Messing around with local-llm, I've settled on `omlx` and `gemma` for "conversational", and I think it's `qwen-120b-a3b-6bit` or something for the "heavy hitter". Gemma "gets it" a lot more, whereas that particular `qwen` tends to fall into the "MuSt WrItE CoOooDeee!" behaviour in a lot of cases instead of holding a conversation, and does an awesome job of randomly spitting out ascii-art diagrams or including full-blown bash shell scripts to illustrate different cases.

My POV is: "Local for slightly slower/casual usage", the ~1% of battery usage per minute of LLM is shockingly accurate (eg: 30 minutes == 30% drop!). "Gemma for discussion and emitting DESIGN-... docs", and "Qwen for converting DESIGN-... to PLAN-...", (as well as implementation, but generally from a fresh context loading the relevant PLAN-... or supporting docs)

...then supplement that with direct Cursor usage in case I screw up some setting on being able to get the local LLM working, or if I need to include literal web-research or really having access to some SOTA model. Using the pi-coder harness locally, web pages are kindof a difficult conundrum as they can be kindof gigantic and are really worthy of special casing, some sort of sub-harness, etc... but the more "stuff" you put into the agent, the less context window (and memory!) you have available, so it's a real balancing act.

The other biggest problem is that you're limited (locally) to ~20-80tps and in some cases you have to chew on or "swallow" the whole prompt up to that point if you end up with some sort of cache miss (TTFT). The `omlx` server does a pretty good job (after you tweak some settings and stuff) of allowing MANY prompt continuations to nearly immediately start generated tokens, but sometimes if I have two agents going (eg: Gemma talking shit about Qwen's output or vice versa) in a longer context window, then you'll take that hit.

"Other people's compute" is definitely more freeing, but even looking at $200/mo usage that's $2400 vs. the ~$6k for a maxed out MBP. Call it $2500 vs. $7500 and you'd say that "local AI gives you a 3-year amortization window for a slower, worse experience" ... but if you're strategic about your usage, the ability to "talk for free" and occasionally "burst" to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice. Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!

In some ways, seeing the "advantage" of having the local 128gb capacity for LLM, I'm semi-wishing I'd have gotten a mac mini instead, but then I can't quite do the 100% offline stuff (eg: coffee-shop) that the maxed out laptop allows.

If it were a mini running locally, I'd feel more comfortable calling it the always-on "AI brain" to process my emails, run crontab summaries, whatever kindof "open-claw-ish" stuff that you could do w/o relying on having to "keep the laptop lid open all the time". I'm sure there's ways to repurpose things, but longer-term, call it even 3-5 years from now... any sort of 128gb machine will be more than capable where you'd want to have one "doing stuff" locally within your home network (IMHO).

▲

chrisweekly a month ago | parent [-]

Thank you! That was a generous and helpful response, I really appreciate it. Food for thought...

>"...if you're strategic about your usage, the ability to "talk for free" and occasionally "burst" to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice. Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!"

^ this resonates, loudly.

	▲	ramses0 25 days ago \| parent [-]
		Thanks, kind stranger! I wrote the comment that I would have loved to find before (and after) making the leap. Stuff is changing so fast, and there's at least three tracks: "Scavenger-old-linux-box", "Fancy-AI-cube", "Mac + $$$ + RAM" Again: I'm finding waaaay enough utility that I'm tempted to invest more "CapEx" and get a used system for day-to-day, "always on" local work... but more literally, that's probably a better job for "OpEx"! Tune my "crontab" work against local models and then max out at a $1/day budget slaved to an always on RPI connected to ethernet at home. $365/year of off-site AI lasts 10 years before I come close to recouping the hardware (and electricity) costs of having "yet another device" purchased and turned on 24x7... and certainly there will come a day when you go to the store and buy a $200-500 "TITO" device (Tokens In => Tokens Out) that plugs into a ~30-60W USB-C port before then. If you're using HF tokens (or "rent-a-A100" or whatever), are always connected to home ethernet (Sun Microsystems: The Network IS the Computer), and maybe supplement with a Kagi backend for attaching to the raw internet then you get _most_ of the surety of "my queries are private" unless you're locally hacked or are the target of nation-state scrutiny. :shrug:? Keep in touch if you end up doing something cool with all this! $USERNAME@yahoo.com (and hopefully I'll have my AI setup filtering out all the viagra spam before then!).

▲

smcleod a month ago | parent | prev | next [-]

That's impressive getting a 397B down to <110GB~. HF link is broken though!

	▲	tarruda 25 days ago \| parent [-]
		> That's impressive getting a 397B down to <110GB It is higher than 110GB. MacOS allows up to 125G of the RAM to be shared with GPU, so it is certainly less than that! > HF link is broken though! Doesn't seem broken to me, but you should be able to search for tarruda/Qwen3.5-397B-A17B-GGUF on huggingface.

▲

ttoinou a month ago | parent | prev [-]

better than antirez ds4 ?

	▲	tarruda a month ago \| parent [-]
		I only tried a very early version of that when it was just a llama.cpp fork and Qwen was certainly better in my tests. But I was not super impressed with deepseek 4 flash using it from the official API either, so it doesn't seem quantization fault. It is a good model, but nothing out of the ordinary in the few benchmarks I ran on it (with full awareness that benchmarks are biased).

▲

gcr a month ago | parent | prev | next [-]

What’s the price point for getting into that sweet spot?

I’m on an M1 Max with 32GB VRAM, so I’m looking forward to the 27B or 35B-A3B models. Is dropping $5k for an RTX 6000 or a DGX Spark really the best option?

▲

anonym29 a month ago | parent | next [-]

Strix Halo at $2k with similar TG and about half the PP of DGX Spark was a pretty good deal IMO, especially considering it's also a full x86 system... 16c/32t Zen 5, 40 CU RDNA 3.5, 128 GB unified memory at ~220 GB/s real-world speeds (256 GB/s theoretical) - that runs full tilt at 140W in performance mode and idles at ~10W.

Unfortunately, the prices rose on these a lot, but unevenly. Beelink GTR 9 Pro is $4400, Framework Desktop is ~$3500, for what is basically the exact same mainboard as a Bosgame M5 for $2800.

Apple's M5 Max is another attractive option. Apple silicon traditionally had great MBW and was good at TG, but struggled with PP, but the new neural engines in those GPU cores have made a big difference in a good way here.

Gorgon Halo is rumored for June announcement with Q4'26 release with basically +100 MHz clocks on Strix Halo, LPDDR5X-8533 instead of LPDDR5X-8000, but more importantly, 192 GB max instead of 128 GB.

I'd say it's better to wait for Gorgon Halo than to grab Strix Halo now. However, Medusa Halo, rumored for H2'27, is slated to have up to 26c Zen 6 (heterogeneous cores - kinds funny that AMD is heading towards these as Intel retreats from them), 48 CU of RDNA 5 instead of 40 CU RDNA 3.5, and a 384 bit bus w/ LPDDR6, which should make 256 GB at more like ~490-600 GB/s MBW, which will really make Strix and Gorgon Halo obsolete.

Also worth keeping an eye out for Serpent Lake (intel CPU + nvidia iGPU on a single board with unified memory, rumored for 2028-2029 iirc), and on the 160 GB Crescent Island Intel dGPU.

▲

tempoponet a month ago | parent | prev | next [-]

Expect to pay $4k-10k

- Your RTX 6000 is closer to $10k now

- Sparks are creeping into the $4-5k range

- AMD Strix are ~3.5k

- Apple depends on chipset and memory. Sweet spot would be 128gb M3 Ultra, probably $6-8k but admittedly haven't been tracking closely. New M5 might come in the fall. You can get a new 128gb M5 Max laptop for ~5-6k today.

- a 4x3090 rig would take $5-6k

Every platform has tradeoffs, but it's mostly ecosystem, memory bandwidth, and power consumption. They're all slow. The best option is likely to rent hardware on Runpod. The RIO on self-hosting is very low unless you have a specific need or you're ok treating it as a hobby.

▲

anonym29 a month ago | parent | next [-]

Bosgame M5 (Strix Halo) w/ 128 GB still goes for $2800 right now. SH systems have surged in price dramatically but quite unevenly.

>The best option is likely to rent hardware on Runpod.

Vast.ai is much cheaper, but the broader point here is contestable. The only dimension in which cloud GPU rentals win is cost. You lose the confidentiality, integrity, and availability benefits of local deployments.

	▲	ai_fry_ur_brain a month ago \| parent [-]
		Rentals are priced to pay themselves off in 1-1.5 years (when renting them out per hour, not selling tokens). Its never a better option to rent. Not that I'd encourage anyone to throw large amounts of money to have access to LLMs, but you're definately going to be better off buying something that you can amortize over multiple years with a multi year warranty.

▲

bahmboo a month ago | parent | prev | next [-]

$2600 gets MBP M5 Pro 48gb. 64gb requires a Max which bumps it to $4200 at which point you may as well spend the $800 to go to 128gb.

▲

ai_fry_ur_brain a month ago | parent | prev [-]

And for what? Spend 10-15k for the slopiest of slop code, non deterministic automations, and the ability to spawn an AI gf?

This whole thing is really starting to remind me of the crypto hype phases of 2016-2018 when everyone thought their investment in GPUs was going to make them rich.

▲

dvfjsdhgfv a month ago | parent | next [-]

I upvoted your comment even though I disagree with you.

Yes, LLMs are sloppy, and local models usually more so (but things change fast).

But the local ones have one big advantage: they are private. So you can safely feed them the collection of your private documents and things you wouldn't trust people like sama with. The fact that some people do not care is one of the failures of our educational system.

▲

organsnyder a month ago | parent | prev | next [-]

It is possible to get real work done with LLMs. There are plenty of ethical concerns, and they're definitely over-hyped, but they are exceptionally useful tools when used well.

	▲	varispeed a month ago \| parent [-]
		[dead]

▲

a month ago | parent | prev | next [-]

[deleted]

▲

gamander2 a month ago | parent | prev [-]

[dead]

	▲	gcr 25 days ago \| parent [-]
		which models do you have in mind? grok from xai?

▲

embedding-shape a month ago | parent | prev | next [-]

If I could find a RTX Pro 6000 for $5K I'd definitively grab it, I'm running RedHatAI/Qwen3.6-35B-A3B-NVFP4 on one (I had to pay closer to $10K for it though) with 260K context and it's a blast! ds4 by antirez also works well, even IQ2XXS seems to work relatively well but Qwen3.6-35B-A3B-NVFP4 is both faster and higher quality responses (at least for coding and translations which I use them mostly for).

▲

tarruda a month ago | parent | prev | next [-]

> What’s the price point for getting into that sweet spot?

In October/2024 I got my Mac studio M1 ultra with 128G, IIRC it was ~$2500. With recent prices explosion, it has certainly gotten more expensive. https://frame.work/ is selling 128G strix halo mainboard for $2700, but you have to add storage and case.

▲

smcleod a month ago | parent | prev | next [-]

Really right now it's the M5 Max MacBook Pro 128GB, the RTX6000 is a nice card but you'd need more than one of them and you have to have a desktop to suit. The DGX Spark is slow and has pretty limited software support.

▲

ttoinou a month ago | parent | prev | next [-]

M5 Max 64GB (sweet spot) or 128GB (only 1000 USD, better to keep it for the future) more are the best quality price ratio, future proof, reliable, resellable and flexible workloads. Harder to use as a server might be the only drawback

▲

throwaw12 a month ago | parent | next [-]

What do you recommend for non-Mac setup? I am a Mac user, but its getting expensive, and not seeing reason to jump to the latest M5

▲

barbacoa a month ago | parent | next [-]

Try looking into Ryzen AI Max 395. AMD made a CPU/GPU soc with unified memory specifically for ai inference. Can buy mini PCs with up to 128gb ram.

▲

krzyk a month ago | parent | next [-]

Isn't CUDA/nvidia the go to solution for most local models, with the rest being second class citizents?

	▲	gcr a month ago \| parent [-]
		Depends. ROCm is pretty well-supported for example. Non-NVIDIA backends tend to get less support and new features land slower, or features that are expected to improve performance wind up hurting it instead. That sort of thing. For basic “token in/token out” workloads without fine tuning, it’s probably fine ??

▲

simple10 a month ago | parent | prev [-]

The Ryzen AI Max 395 128gb is super cool, but not fast for inference. Order of magnitude slower than dedicated GPU but at half the cost. You can run larger models on it but it's slow. Great for local async work. Not great for daily chat or code agent driver.

▲

throwa356262 a month ago | parent [-]

The latest NPUs are pretty fast, I think what is missing is more optimised software support.

	▲	plagiarist a month ago \| parent [-]
		The vRAM bandwidth is at least as much a problem as compute on these ones, there is a lot of data to shuffle around

▲

varispeed a month ago | parent | prev [-]

Probably a comparable non-Mac setup will be Threadripper, but it will become much more expensive. My view is that actually Apple products are the cheapest on the market when it comes to performance.

▲

roger_ a month ago | parent | prev [-]

M5 Max 128GB for $1k?

	▲	tempoponet a month ago \| parent \| next [-]
		The memory upgrade is $1k on a Macbook Pro. The laptop is ~$5500.
	▲	smallerize a month ago \| parent \| prev [-]
		I think they mean the upgrade to 128GB is +$1k.

▲

tandr a month ago | parent | prev | next [-]

Don't mind me asking, but where did you find $5k RTX 6000? Even 48GB model (previous gen) shows minimum at 7k, and 96GB one (Blackwell) is ~10k on Amazon...

▲

CamperBob2 25 days ago | parent [-]

$5K is presumably what it costs to pay some local gangsters to break into an nVidia warehouse. That's the only you will pay $5K for an RTX 6000 for the next couple of years.

The server edition has gone up $2K in the last couple of weeks alone, at the outlet where I bought one previously.

	▲	tandr 24 days ago \| parent [-]
		Man... Does it mean that buying RTX 6000 for 10k today is actually becoming an investment?

▲

pulse-dev a month ago | parent | prev [-]

[dead]

▲

KronisLV a month ago | parent | prev [-]

There definitely have been some options in the past, cool to see them.

Oddly enough, though, Qwen 3.6 35B A3B and Gemma got some really good reviews, despite being way smaller than any of these ones.

Qwen 3.5, 122B A10B: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF

Qwen Coder Next, 80B A3B: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

It's kinda weird that DeepSeek V4 Flash is supposed to be 284B A13B, but shows up as 158B in HuggingFace, probably some weird bug: https://huggingface.co/unsloth/DeepSeek-V4-Flash and that's not even just Unsloth but like the official source too https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash (so also doesn't fit the category unless you get a heavily quantized version to run, but cool regardless)

Mistral Medium 3.5 is interesting because it's 128B but dense, so probably too slow for most folks: https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

GPT-OSS, 120B A5B: https://huggingface.co/unsloth/gpt-oss-120b-GGUF