Remix.run Logo
tarruda 21 days ago

I have a 128G mac studio and even 397B was a happy surprise to me due to its high quantization resilience.

I've created a 2.54BPW quant that fit on my hardware with 128k context, 20 tps tg and 200tps pp, while maintaining high scores on many benchmarks: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discus...

chrisweekly 21 days ago | parent | next [-]

Apple store's current options for mac studio seem to max out at 96GB. I'm questioning ROI, esp. given it's not upgradeable. Curious about others' takes on new mac hardware.

tarruda 21 days ago | parent | next [-]

> I'm questioning ROI

If by ROI you mean saving more money than using paid APIs, then I don't think it is worth it. All you gain is full sovereignty over your AI usage.

hadlock 20 days ago | parent | prev | next [-]

Rumor mill has been buzzing about m5 mini and studio. If anything materializes close to what the rumor mill has been suggesting, the m5 could be appealing to home lab/local LLM folks, or at least help inform if the M6 will be worthwhile. Assuming Apple was able to lock in halfway reasonable memory prices early enough in advance.

drob518 21 days ago | parent | prev | next [-]

Currently, Apple is letting some of its models go out of stock in preparation for new models coming in a few weeks. I would expect at least 128 GB models at that time. That said, the memory crunch is hitting everyone.

the_lucifer 20 days ago | parent [-]

Yep, even with their supply chain prowess, they're being hit now given some longer term contracts vis-à-vis their memory are nearing renewals.

drob518 20 days ago | parent [-]

Yep. Something needs to break soon. Or rather, something WILL break soon, one way of another. Was talking to a friend last night who works planning infrastructure rollout and he said costs for equipment has roughly doubled in the last six months. Soon, these projects aren’t going to be viable.

ramses0 20 days ago | parent | prev [-]

I'd held off from buying a new personal laptop for quite a few years and felt that the M5-128gb was justifiable once I started really seeing payoffs from using AI at work.

Running w/ Cursor and doing some "nights and weekends" type coding / conversations, I was hitting $100-200 of usage within a few weeks. I know there's probably better ways to manage costs, but I was getting enough value out of it to keep bumping my spend limit from $20 => $40 => $80 => $120 (and then I stopped spending! :-)

Messing around with local-llm, I've settled on `omlx` and `gemma` for "conversational", and I think it's `qwen-120b-a3b-6bit` or something for the "heavy hitter". Gemma "gets it" a lot more, whereas that particular `qwen` tends to fall into the "MuSt WrItE CoOooDeee!" behaviour in a lot of cases instead of holding a conversation, and does an awesome job of randomly spitting out ascii-art diagrams or including full-blown bash shell scripts to illustrate different cases.

My POV is: "Local for slightly slower/casual usage", the ~1% of battery usage per minute of LLM is shockingly accurate (eg: 30 minutes == 30% drop!). "Gemma for discussion and emitting DESIGN-... docs", and "Qwen for converting DESIGN-... to PLAN-...", (as well as implementation, but generally from a fresh context loading the relevant PLAN-... or supporting docs)

...then supplement that with direct Cursor usage in case I screw up some setting on being able to get the local LLM working, or if I need to include literal web-research or really having access to some SOTA model. Using the pi-coder harness locally, web pages are kindof a difficult conundrum as they can be kindof gigantic and are really worthy of special casing, some sort of sub-harness, etc... but the more "stuff" you put into the agent, the less context window (and memory!) you have available, so it's a real balancing act.

The other biggest problem is that you're limited (locally) to ~20-80tps and in some cases you have to chew on or "swallow" the whole prompt up to that point if you end up with some sort of cache miss (TTFT). The `omlx` server does a pretty good job (after you tweak some settings and stuff) of allowing MANY prompt continuations to nearly immediately start generated tokens, but sometimes if I have two agents going (eg: Gemma talking shit about Qwen's output or vice versa) in a longer context window, then you'll take that hit.

"Other people's compute" is definitely more freeing, but even looking at $200/mo usage that's $2400 vs. the ~$6k for a maxed out MBP. Call it $2500 vs. $7500 and you'd say that "local AI gives you a 3-year amortization window for a slower, worse experience" ... but if you're strategic about your usage, the ability to "talk for free" and occasionally "burst" to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice. Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!

In some ways, seeing the "advantage" of having the local 128gb capacity for LLM, I'm semi-wishing I'd have gotten a mac mini instead, but then I can't quite do the 100% offline stuff (eg: coffee-shop) that the maxed out laptop allows.

If it were a mini running locally, I'd feel more comfortable calling it the always-on "AI brain" to process my emails, run crontab summaries, whatever kindof "open-claw-ish" stuff that you could do w/o relying on having to "keep the laptop lid open all the time". I'm sure there's ways to repurpose things, but longer-term, call it even 3-5 years from now... any sort of 128gb machine will be more than capable where you'd want to have one "doing stuff" locally within your home network (IMHO).

chrisweekly 20 days ago | parent [-]

Thank you! That was a generous and helpful response, I really appreciate it. Food for thought...

>"...if you're strategic about your usage, the ability to "talk for free" and occasionally "burst" to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice. Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!"

^ this resonates, loudly.

ramses0 20 days ago | parent [-]

Thanks, kind stranger! I wrote the comment that I would have loved to find before (and after) making the leap. Stuff is changing so fast, and there's at least three tracks: "Scavenger-old-linux-box", "Fancy-AI-cube", "Mac + $$$ + RAM"

Again: I'm finding waaaay enough utility that I'm tempted to invest more "CapEx" and get a used system for day-to-day, "always on" local work... but more literally, that's probably a better job for "OpEx"! Tune my "crontab" work against local models and then max out at a $1/day budget slaved to an always on RPI connected to ethernet at home.

$365/year of off-site AI lasts 10 years before I come close to recouping the hardware (and electricity) costs of having "yet another device" purchased and turned on 24x7... and certainly there will come a day when you go to the store and buy a $200-500 "TITO" device (Tokens In => Tokens Out) that plugs into a ~30-60W USB-C port before then.

If you're using HF tokens (or "rent-a-A100" or whatever), are always connected to home ethernet (Sun Microsystems: The Network IS the Computer), and maybe supplement with a Kagi backend for attaching to the raw internet then you get _most_ of the surety of "my queries are private" unless you're locally hacked or are the target of nation-state scrutiny. :shrug:?

Keep in touch if you end up doing something cool with all this! $USERNAME@yahoo.com (and hopefully I'll have my AI setup filtering out all the viagra spam before then!).

smcleod 20 days ago | parent | prev | next [-]

That's impressive getting a 397B down to <110GB~. HF link is broken though!

tarruda 20 days ago | parent [-]

> That's impressive getting a 397B down to <110GB

It is higher than 110GB. MacOS allows up to 125G of the RAM to be shared with GPU, so it is certainly less than that!

> HF link is broken though!

Doesn't seem broken to me, but you should be able to search for tarruda/Qwen3.5-397B-A17B-GGUF on huggingface.

ttoinou 21 days ago | parent | prev [-]

better than antirez ds4 ?

tarruda 21 days ago | parent [-]

I only tried a very early version of that when it was just a llama.cpp fork and Qwen was certainly better in my tests.

But I was not super impressed with deepseek 4 flash using it from the official API either, so it doesn't seem quantization fault. It is a good model, but nothing out of the ordinary in the few benchmarks I ran on it (with full awareness that benchmarks are biased).