I love my MacBook Pro M5 128GB RAM and I love qwen3.6.

BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.

Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.

If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.

Thank me later.

▲

827a a minute ago | parent | next [-]

Apple does not sell a 64GB variant of the M4 Mac Mini. IIRC they never have; its always capped out at 48GB.

If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.

▲

astrostl 2 hours ago | parent | prev | next [-]

> MacBook Pro M5 128GB RAM

614 GB/s of memory bandwidth

> MacMini M4 with 64GB of RAM

273 GB/s of memory bandwidth (also only currently available with 48GB)

When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.

And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.

	▲	bigyabai 9 minutes ago \| parent [-]
		> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.

▲

jtbaker 41 minutes ago | parent | prev | next [-]

Nope, have both these machines, can confirm the M5 max blows the M4 mini away. It does get hot, but I use it mostly with an external monitor and keyboard. Conceptually I like the headless model better with a workstation, but work was buying the M5 and can't get it in any other form factor at the monute.

▲

SwellJoe 4 hours ago | parent | prev | next [-]

I opted to buy a normal 32GB laptop for this very reason. I know how loud and hot the GPUs in my desktop run when running even smallish models like Qwen 27B or Gemma 4 31B (which is a better model for most than Qwen 3.6, despite the benchmarks). I also have a Strix Halo which doesn't get loud, because it has a single huge fan, but it does get hot. So, there's no way a laptop could work as hard as models make them work, and not be unbearable. Tiny fans trying to remove all that heat? They gotta be screaming. No reason to spend all that money on a laptop that I couldn't realistically make use of. I do run a lot of VMs on my desktop, but I can get to those on a VPN.

It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.

All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.

▲

girvo 3 hours ago | parent [-]

Gemma is better than Qwen at everything except coding, in all my evaluations. Which is a shame because that is what I use them for!

▲

UncleOxidant 2 hours ago | parent | next [-]

It would be great if the Gemma folks would release a code-focused model. Probably won't happen, but it's fun to dream.

	▲	SwellJoe an hour ago \| parent [-]
		The Ornith folks say they're doing that, but haven't released the Gemma-based 31b yet (https://github.com/deepreinforce-ai/Ornith-1). But, also, the Qwen-based 35b MoE Ornith version performs worse than Qwen 3.6 and Qwen AgentWorld on my benchmarks (which are focused on finding security bugs, so not exactly the same as agentic coding, but closely related skills). That said, the reason they're able to release Ornith branded post-trains of both Gemma and Qwen is because they're open weights under a friendly license. Someone, not just Google, could make a coding focused Gemma post-train. I don't think it's actually much weaker than Qwen 3.6 for coding; Gemma 4 31b outperforms Qwen 3.6 27b by a wide margin on security bug hunting (at least for the specific bugs in my benchmarks, which are mostly relatively difficult bugs from the Mythos-reported bugs). I'd really love to see a bigger MoE from Google, though. A 70b or 120b MoE would likely be super fun.

▲

ekianjo 41 minutes ago | parent | prev | next [-]

gemma is also worse for tool calling. not just coding

▲

an hour ago | parent | prev [-]

[deleted]

▲

andai 4 hours ago | parent | prev | next [-]

> The reason is simple: your fingers will burn and your head will explode from the noise.

So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)

I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.

There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).

There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)

But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...

▲

iagooar 4 hours ago | parent | next [-]

Just buy a Mac Mini really is good advice if you want to get into real, always-on convenient agentic work.

Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.

▲

marcuskaz 4 hours ago | parent [-]

Except they're not available, 3-4 month wait time.

▲

KiwiJohnno 3 minutes ago | parent | next [-]

I ordered a mac mini m4 pro with 48 gb of ram a couple of weeks ago. Apple said 8-9 weeks.

▲

iagooar 3 hours ago | parent | prev [-]

Buy a refurished or 2nd hand one.

▲

1over137 3 hours ago | parent [-]

Also not really available.

	▲	klardotsh 2 hours ago \| parent [-]
		Especially with anything resembling a usable amount of RAM. Mac Minis and Studios >=64GB are basically permanently sold out everywhere, because everyone, including commercial entities with deeper pockets than most of us plebs, has the exact same idea at the exact same time.

▲

4 hours ago | parent | prev | next [-]

[deleted]

▲

3 hours ago | parent | prev [-]

[deleted]

▲

Arch-TK an hour ago | parent | prev | next [-]

It's okay, completely wrong thread for this statement, but I wouldn't voluntarily use current MacOS (no idea if the older variants weren't terrible) over anything but ssh. Worse than Windows 11.

	▲	braebo 3 minutes ago \| parent [-]
		I could not disagree more.

▲

roadside_picnic 2 hours ago | parent | prev | next [-]

In general if you're setting up a local LLM you should assume it's going to be primarily working as a server and talking to various clients. I use my MBP, but that's because I don't travel much anymore so it can happily work as a server at all times. With the right agent setup you can probably manage most things from your phone even if you don't have a seperate machine to use as a client.

I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).

Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).

▲

swang 5 hours ago | parent | prev | next [-]

I have an M4 Max and when I was trying out local LLM work with pi it has probably felt like the hottest I've ever felt any kind of Macbook be. I could feel the radiated heat off it even a few inches away. Honestly felt hotter than any Intel Macbook I've used. Because of that I stopped as I didn't want to harm my laptop in case I need to hold it for 10 years due to all the supply issues/price increases.

	▲	dimitrios1 4 hours ago \| parent [-]
		I tried to run it on a M4 Air for shits and giggles. After about 1 minute the entire machine basically bricked and I had to hard reset :D

▲

geophile 4 hours ago | parent | prev | next [-]

That's exactly what I'm doing -- Mini M4 Pro 64GB, qwen3.6.

My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.

▲

somewhatrandom9 3 hours ago | parent | prev | next [-]

Try using DwarfStar 4 and use the --power flag: https://github.com/antirez/ds4#reducing-heat-power-usage-and...

▲

boomskats 2 hours ago | parent [-]

Can you run Qwen 3.6 27B on antirez/ds4 now? I thought it was all about the DeepSeek models.

	▲	somewhatrandom9 2 hours ago \| parent [-]
		No, I don't think Qwen, but I believe he may try and put some version of GLM in it.

▲

acters 5 hours ago | parent | prev | next [-]

Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark?

You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.

▲

c7b 4 hours ago | parent | next [-]

My 2c: you don't need the Strix Halo desktop, the chip comes in many rigs, most of them cheaper, the performance difference isn't worth it. It used to be half the price of a DGX Spark or a Mac with 128GB RAM. If you can still find it at that price I'd say it's the best bang for your buck. Otherwise, Macs have 2-3x the memory bandwidth of the DGX Spark, depending on the chip, so I'd prefer them. Unless you're planning on building a cluster. The DGX Spark has two 100GB/s connectors, ideal for clustering. But I haven't checked what else you could get for the price of two DGX Sparks.

▲

girvo 3 hours ago | parent | prev | next [-]

My GB10 Spark-alike is absolutely amazingly fun… but it is not cost effective. Step 3.7 Flash is shockingly capable (IQ4_XS and used for web dev mainly), but it cost me $6800 AUD. They’re even more expensive now. The numbers just don’t make sense: with proper triple head MTP I can get it up to ~40tk/s decode and it runs at around 1000+ tk/s prefill.

$6800 is a lot of API credits for GLM, for example, on any provider you want to use.

Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.

I still am going to buy a second one haha

▲

lee_ars 4 hours ago | parent | prev | next [-]

I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs.

But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.

Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.

	▲	coder543 an hour ago \| parent \| next [-]
		Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated. I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models. As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again. Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.
	▲	cpburns2009 2 hours ago \| parent \| prev \| next [-]
		Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.
	▲	anon373839 2 hours ago \| parent \| prev \| next [-]
		I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.
	▲	rnxrx 4 hours ago \| parent \| prev \| next [-]
		There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).
	▲	gnerd00 40 minutes ago \| parent \| prev [-]
		`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant

▲

pkroll 4 hours ago | parent | prev [-]

Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.

▲

toephu2 3 hours ago | parent | prev | next [-]

I just checked apple's website and configured them:

Mac Studio: Ships: 16–18 weeks

Mac mini: Ships: 10–12 weeks

▲

overgard 4 hours ago | parent | prev | next [-]

I'm running an M5 Max 128GB with Qwen 3.6 and unreal engine in the background and it seems to be ok for me. Quite a power drain if it's not plugged in but I haven't seen any thermal issues.

▲

xd1936 5 hours ago | parent | prev | next [-]

Apple does not currently sell a Mac Mini with 64GB RAM.

▲

iagooar 4 hours ago | parent [-]

Get a 2nd hand one. I was lucky enough to get a new one first, last week I get a 2nd hand one in order to run one of my Hermes minions at work.

▲

stevenaenns 4 hours ago | parent [-]

how many tokens/s generation do you get?

	▲	iagooar 4 hours ago \| parent [-]
		Ballpark 25-30 tok / sec on the Mac Mini Pro M4 + qwen3.6 35B. The generation itself is good, prefill is known to be slow on any Apple M-chip architecture. It is really decent.

▲

bilekas 2 hours ago | parent | prev | next [-]

Can you define "serious programming"? Because I use it to implement things I COULD go and figure out like algorithms or test generation or evaluations etc, the "serious" programming I tend to do myself. That is what I'm paid for.

▲

Arubis 5 hours ago | parent | prev | next [-]

Don't forget that your OLED screen will start to color-shift as the heat cooks the panel!

▲

manmal 5 hours ago | parent [-]

There is no MacBook Pro with OLED (yet).

	▲	Arubis 5 hours ago \| parent [-]
		My mistake on tech; it’s a beautiful display. Alas, I speak from experience when it comes to the thermally-caused color shift. Hopefully it’ll be AppleCare covered.

▲

3 hours ago | parent | prev | next [-]

[deleted]

▲

cosmic_cheese 4 hours ago | parent | prev | next [-]

They really need to release those updated Studios already.

	▲	DennisP 3 hours ago \| parent [-]
		Since they've reduced the max RAM on current Studios from 512GB to 96GB, I'm not holding my breath.

▲

stared 2 hours ago | parent | prev | next [-]

Yes, it gets really hot really fast.

As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.

▲

Matl 4 hours ago | parent | prev | next [-]

> If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk.

Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.

But you do need a fast LAN connection, otherwise working with agents will be a pain.

	▲	Retr0id 4 hours ago \| parent \| next [-]
		> you do need a fast LAN connection Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.
	▲	iagooar 4 hours ago \| parent \| prev [-]
		I disagree LAN connection is the bottleneck. I do even work with it remotely via Tailscale on shaky hotel WIFI and it works fine (or as fine as any other API-based model).

▲

cmgbhm 5 hours ago | parent | prev | next [-]

A local model on my m2 made me come to that conclusion but I definitely was having “that config is $2k more” regret. Thanks for posting this!

▲

c7b 3 hours ago | parent | prev | next [-]

This. Do consider local LLMs, but set aside a dedicated machine for it. Connect via VPN or reverse proxy. If it's not a Mac them I'd also put a server distro on it. No need for a desktop environment, save your RAM.

	▲	tedivm 3 hours ago \| parent [-]
		I have a Linux box with two 3090s and it's been great for running Qwen3.6 27b. I lowered the power on each card down to 250w, and then built a small ducting/fan system to vent the waste heat outside. The machine is pretty much silent, and I'm still getting 110 tokens per second out of it for coding tasks. https://github.com/tedivm/qwen36-27b-docker

▲

oceanplexian 5 hours ago | parent | prev | next [-]

If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.

▲

chorizo 5 hours ago | parent | next [-]

That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

▲

nsbk 4 hours ago | parent | next [-]

I beg to differ. Have a look at this repo with single/double 3090 optimized configs for Qwen and Gema models: https://github.com/noonghunna/club-3090

▲

4 hours ago | parent | prev | next [-]

[deleted]

▲

sanderjd 4 hours ago | parent | prev | next [-]

Yeah seems to me like the mac studios with the unified memory architecture are genuinely good bang for the buck at the moment, because of this memory size consideration?

▲

SkitterKherpi 5 hours ago | parent | prev [-]

You can run 8bit 27B models at 24GB, it's definitely enough for the model size.

▲

SwellJoe 4 hours ago | parent | next [-]

The 8-bit quantized 27B Qwen 3.6 is 29GB. You absolutely cannot run that entirely on a 24GB GPU.

You could run a 4-bit, which is 16-17GB. But, you'd need a smallish context or you'd need to quantize your KV cache. Something like TurboQuant or RotorQuant might help.

32GB is the lower bound for comfortably running this size model. I'd maybe even say 64GB is right-sized, because a 256k context is nice to have for agentic workflows, and that won't fit on a 32GB card without heavy quantization (but I haven't tried TurboQuant or RotorQuant to know what impact it has on memory use for context).

You could also put some of the model into system RAM, but that defeats the purpose of your argument that a 3090 will outperform a Mac Mini or Mac Studio. If part of a dense model is in system RAM, it absolutely will not outperform a recent unified memory device.

▲

cpburns2009 3 hours ago | parent [-]

A 32gb card does run it nicely. I use unsloth's UD-Q5_K_XL at 256k context (k/v at q8_0), and get ~67 t/s on a 5090. I still need to look into MTP.

	▲	pbgcp2026 an hour ago \| parent [-]
		[dead]

▲

barbacoa 3 hours ago | parent | prev | next [-]

I'm running qwen 3.6 27b at 8bit quantization and 262k context. It takes 53gb of vram on my system.

▲

bityard 4 hours ago | parent | prev | next [-]

Quantization is a trade-off, though. The quality, while still perhaps good enough for many tasks, is not as good as the full 16-bit weights that the model was designed for/released with.

	▲	pbgcp2026 an hour ago \| parent [-]
		[dead]

▲

jnovek 4 hours ago | parent | prev [-]

I think that’s only true for MoE models. A dense model like 3.6 27b will require more (plus a KV store).

	▲	bityard 4 hours ago \| parent [-]
		No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.

▲

iagooar 4 hours ago | parent | prev | next [-]

My problem is I won't accept anything lower than the 96GB the RTX Pro 6000 Blackwell has. My dream is a workstation with 2x Pro 6000 to run DeepSeek v4 Flash comfortably, possibly qwen 3.6 / ornith on turbo speed.

But man, I have never purchased a computer which is more expensive than a decent family car.

▲

jnovek 5 hours ago | parent | prev | next [-]

An M1 Ultra has 800gbps unified memory. It’s nothing to do with Apple, it’s their microarchitecture. They’re just about the only game in town with high-bandwidth memory if you want >24GB (for less than $10k, anyway).

	▲	murderfs 3 hours ago \| parent [-]
		A 5090 gets you 32GB with 1.8 TB/s of memory bandwidth for ~$4k, RTX A6000 gets you 48GB at 768 GB/s for ~$3.5k, 2x 3090 gets you 48GB for $2000 or so, and if you're willing to go into the wilderness, there are much cheaper options like the AMD MI50.

▲

dheera 3 hours ago | parent | prev [-]

32GB V100

▲

jarjoura 4 hours ago | parent | prev | next [-]

TBF, I just recently picked up this same model, and it's reminding me of the last gen Intel i9 MBP. Just visiting any non-basic website spins up the fans and battery life isn't great either. Yes, this thing is fast, but damn it gets hot just using it for normal tasks.

Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.

	▲	y1n0 2 hours ago \| parent \| next [-]
		Is there something wrong with the m5s? I have an m4 pro and I’ve never heard the fan on it. I don’t do much with local llms, but I naturally use the web and play games (windows games at that with wine/crossover).
	▲	inventor7777 2 hours ago \| parent \| prev \| next [-]
		That seems very unusual for modern Apple Silicon. Our family has: - M3 Pro MacBook Pro 36GB - M2 Pro MacBook Pro 16GB - Mac Studio M4 Max 48GB and I have not heard the fans on any of them with normal use. The only time I've ever heard automatic fans was when I was using a local 12B model on the M3 MacBook Pro, and when running 70B models on the Studio. You should consider checking Activity Monitor and making sure that the usual suspects are not causing issues with sustained high CPU. And you can use an app like [Stats](https://mac-stats.com) if you want to see that info while actively using the computer.
	▲	lowbloodsugar 11 minutes ago \| parent \| prev \| next [-]
		This is not normal. You have a broken Mac. Make an appointment.
	▲	4 hours ago \| parent \| prev [-]
		[deleted]

▲

SkitterKherpi 5 hours ago | parent | prev | next [-]

I am considering getting something like NVIDIA's RTX Spark when it comes out, though even that will be limited to 128GB.

▲

jazzyjackson 5 hours ago | parent | next [-]

They’ll sell you a bundle, either a pair or a quartet so you can have 256 or 512GB over a 400GB/s network link

I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option

▲

c7b 4 hours ago | parent | next [-]

You could fit a Q4 GLM5.2 in 512GB and still have some space for context (372-475GB for the model): https://unsloth.ai/docs/models/glm-5.2

But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.

	▲	rnxrx 3 hours ago \| parent [-]
		It depends on what's meant by "fully utilized" but fp8 quants of Nemotron 3 Super, the latest Minimax, Cohere A+ and the Mistral small and (especially) medium variants all sit in that 128-256 category, especially with full context or even moderate concurrency. In fact, in a 192GB environment I work with (Hopper GPUs, fwiw) I was pushed into using 4-bit quants with a couple of those to get the model working with a reasonable context window (..but 256 would have rocked out).

▲

girvo 3 hours ago | parent | prev | next [-]

Not Llama 3.1, but Step 3.7 Flash is one of the few new high quality models in this size bracket. DeepSeek v4 Flash too

▲

SkitterKherpi 5 hours ago | parent | prev [-]

10k is rather a lot yes. For LLMs you can use a lot of tokens with 10k with less hassle without the machine (and also it's not like electricity is free), but for some other things like video models 10k would get burned very fast. I am looking for something more in the 5k range though.

▲

awesomeusername 5 hours ago | parent | prev [-]

It's out, I'm daily driving one. It's great

	▲	SkitterKherpi 4 hours ago \| parent \| next [-]
		I assume you have the dgx spark? At this point I am not 100% on the difference other than Linux and Windows. The RTX spark should come around Q4, unless I am mistaken.
	▲	vikingcat 4 hours ago \| parent \| prev [-]
		Are you running a local LLM on it? Did you buy a whole laptop?

▲

codazoda 4 hours ago | parent | prev | next [-]

Today the Mini tops out at 48GB. Gotta go to the Studio to get 64GB.

▲

aurareturn 3 hours ago | parent [-]

Don't buy the Mini or Studio. Both have the M4 which lacks the Neural Accelerators, making prompt processing ~3-4x slower.

▲

mortenjorck 3 hours ago | parent [-]

I assume those don't just work automatically with an off-the-shelf gguf. What do you need in your local inference stack to take advantage of M5's neural accelerators?

	▲	aurareturn 3 hours ago \| parent [-]
		They do work with llama.cpp and MLX automatically.

▲

seanmcdirmid 4 hours ago | parent | prev | next [-]

What sort of M5 are you running? A max? MacMini's don't offer max CPUs.

▲

iagooar 4 hours ago | parent [-]

M5 Max. But I also have a MacMini M4 Pro 64GB. Qwen3.6 runs on the M4 just fine - sure the M5 is at least 2x the speed. If Apple launches a MacMini with an M5, I will be the 1st one to get it.

▲

kristianp 4 hours ago | parent [-]

You're only going to get an incremental improvement with an M5 Pro mini compared to an M4 Pro mini. Memory bandwidth goes from 273GB/s to 307GB/s, about 12.5% improvement for LLMs.

	▲	freehorse 2 hours ago \| parent \| next [-]
		M5's have the neural accelarator that boosts prefill speed a lot. But token generation itself will not change that much, that's true.
	▲	iagooar 4 hours ago \| parent \| prev [-]
		I thought they might ship an M5 Max version, but you are probably right.

▲

samtheprogram 3 hours ago | parent | prev | next [-]

Are you sure you're running it with MLX?

▲

busymom0 5 hours ago | parent | prev | next [-]

Also look into buying the Mac mini refurbished from Apple. They come almost brand new, same warranty and you save money.

▲

Fr0styMatt88 4 hours ago | parent | prev | next [-]

What kind of speed in tk/s do you get with the MacBook?

	▲	iagooar 4 hours ago \| parent [-]
		qwen3.6 27B MLX 8bit -> 15 tok / sec. A bit slow but it is a delightful model to use, and smart too. qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).

▲

singpolyma3 3 hours ago | parent | prev | next [-]

With 128 you can run 122b ;)

▲

gigatexal 2 hours ago | parent | prev | next [-]

Same. And your M5 has acceleration that I don’t with my M3 max. I can’t do anything local it gets hotter than an Intel Mac trying to run docker from back in the day.

▲

verdverm 5 hours ago | parent | prev | next [-]

Get an OEM Spark instead, mine are silent and can fit 2 qwen/gemma at 8bit or give you room for a bunch of other, smaller models (embed,rerank,etc)

▲

dzonga 3 hours ago | parent | prev | next [-]

why not buy one of those "a.i" desktop kits being sold by Nvidia/AMD and just connect to them via network ?

to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.

	▲	Gigachad 2 hours ago \| parent [-]
		It's still currently way cheaper to pay open router to run qwen for you. And you have the option to use much bigger better models like DeepSeek v4 flash.

▲

ActorNightly 4 hours ago | parent | prev [-]

>If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement

Im sorry, but its time to start calling Apple sycophants out. Stop trying to push your tech jewelry on other people. You only buy those computers because they are Apple, you don't know anything about computing or running LLMs, you don't do any real work, so you should probably not give advice on what to buy.

A single 3090 will run Qwen3.6 27b fine, and its VRAM speed is twice of what the best Mac has. And the build will be cheaper. Decent CPU/Motherboard, 32gb of DDR4 ram, an SSD and a Single 3090 should run max about $4grand. Mac m4 mini is 6grand.

Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.

Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.

▲

iagooar 3 hours ago | parent [-]

I am not going to flag you, I am much OK with having good arguments.

I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.

I am not a hater of Nvidia and I am planning on building a workstation based on RTX cards. You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).

I am pretty sure I know a thing or two about computing, I have been in the trenches for many, many years and I have had machines of all kinds, shapes and colors. It just so happens that Macs are very capable, very convenient machines that happen to work great in the era of LLMs, too.

But you do you.

	▲	ActorNightly 3 hours ago \| parent [-]
		>You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc). If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff. But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.