| ▲ | dirk94018 9 hours ago |
| On M4 Max 128GB we're seeing ~100 tok/s generation on a 30B parameter model in our from scratch inference engine. Very curious what the "4x faster LLM prompt processing" translates to in practice. Smallish, local 30B-70B inference is genuinely usable territory for real dev workflows, not just demos. Will require staying plugged in though. |
|
| ▲ | fotcorn 8 hours ago | parent | next [-] |
| The memory bandwith on M4 Max is 546 GB/s, M5 Max is 614GB/s, so not a huge jump. The new tensor cores, sorry, "Neural Accelerator" only really help with prompt preprocessing aka prefill, and not with token generation. Token generation is memory bound. Hopefully the Ultra version (if it exists) has a bigger jump in memory bandwidth and maximum RAM. |
| |
| ▲ | anentropic 8 hours ago | parent [-] | | Do any frameworks manage to use the neural engine cores for that? Most stuff ends up running Metal -> GPU I thought | | |
|
|
| ▲ | hu3 9 hours ago | parent | prev | next [-] |
| What about real workloads? Because as context gets larger, these local LLMs aproxiate the useless end of the spectrum with regards to t/s. |
| |
| ▲ | zozbot234 3 hours ago | parent | next [-] | | The thing about context/KV cache is that you can swap it out efficiently, which you can't with the activations because they're rewritten for every token. It will slow down as context grows (decode is often compute-limited when context is large) but it will run. | |
| ▲ | Someone1234 9 hours ago | parent | prev | next [-] | | I strongly agree. People see local "GPT-4 level" responses, and get excited, which I totally get. But how quickly is the fall-off as the context size grows? Because if it cannot hold and reference a single source-code file in its context, the efficiency will absolutely crater. That's actually the biggest growth area in LLMs, it is no longer about smart, it is about context windows (usable ones, note spec-sheet hypotheticals). Smart enough is mostly solved, combating larger problems is slowly improving with every major release (but there is no ceiling). | |
| ▲ | satvikpendem 8 hours ago | parent | prev [-] | | That should be covered by the harness rather than the LLM itself, no? Compaction and summarization should be able to allow the LLM to still run smoothly even on large contexts. | | |
| ▲ | hu3 6 hours ago | parent [-] | | Sometimes it really needs a lot of data to work. |
|
|
|
| ▲ | storus 9 hours ago | parent | prev | next [-] |
| 4x faster is about token prefill, i.e. the time to first token. It should be on par with DGX Spark there while being slightly faster than M4 for token generation. I.e. when you have long context, you don't need to wait 15 minutes, only 4 minutes. |
|
| ▲ | fulafel 8 hours ago | parent | prev | next [-] |
| The marketing subterfugue might be about this exactly, technically prompt processing means the prefill phase of inference. So prompt goes in 4x as fast but generates tokens slower. This seems even likely as the memory bandwidth hasn't increased enough for those kinds of speedups, and I guess prefill is more likely to be compute-bound (vs mem bw bound). |
| |
| ▲ | petercooper 4 hours ago | parent [-] | | So prompt goes in 4x as fast but generates tokens slower. I'd take that tradeoff. On my M3 Ultra, the inference is surprisingly fast, but the prompt processing speed makes it painful except as a fallback or experimentation, especially with agentic coding tools. |
|
|
| ▲ | eknkc 9 hours ago | parent | prev | next [-] |
| I find time to first token more important then tok/s generally as these models wait an ungodly amount of time before streaming results. It looks like the claims are true based on M5: https://www.macstories.net/stories/ipad-pro-m5-neural-benchm... so this might work great. |
|
| ▲ | barumrho 8 hours ago | parent | prev | next [-] |
| 100 tok/s sounds pretty good. What do you get with 70B?
With 128GB, you need quantization to fit 70B model, right? Wondering if local LLM (for coding) is a realistic option, otherwise I wouldn't have to max out the RAM. |
| |
| ▲ | super_mario 7 hours ago | parent [-] | | I run gpt-oss 120b model on ollama (the model is about 65 GB on disk) with 128k context size (the model is super optimized and only uses 4.8 GB of additional RAM for KV cache at this context size) on M4 Max 128 GB RAM Mac Studio and I get 65 tokens/s. | | |
| ▲ | abhikul0 6 hours ago | parent [-] | | Have you tried the dense(27B,9B) Qwen3.5 models? Or any diffusion models (Flux Klein, Zimage)? I'm trying to gauge how much of a perf boost I'd get upgrading from an m3 pro. For reference: | model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 ?B Q5_K - Medium | 6.12 GiB | 8.95 B | MTL,BLAS | 6 | pp512 | 288.90 ± 0.67 |
| qwen35 ?B Q5_K - Medium | 6.12 GiB | 8.95 B | MTL,BLAS | 6 | tg128 | 16.58 ± 0.05 |
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | MTL,BLAS | 6 | pp512 | 615.94 ± 2.23 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | MTL,BLAS | 6 | tg128 | 42.85 ± 0.61 |
Klein 4B completes a 1024px generation in 72seconds.
|
|
|
|
| ▲ | butILoveLife 9 hours ago | parent | prev [-] |
| [flagged] |
| |
| ▲ | dirk94018 9 hours ago | parent | next [-] | | For chat type interactions prefill is cached, prompt is processed at 400tk/s and generation is 100-107tk/s, it's quite snappy. Sure, for 130,000 tokens, processing documents it drops to, I think 60tk/s, but don't quote me on that. The larger point is that local LLMs are becoming useful, and they are getting smarter too. | |
| ▲ | macintux 9 hours ago | parent | prev | next [-] | | Please read the guidelines and consider moderating your tone. Hostility towards other commenters is strongly discouraged. | |
| ▲ | kamranjon 9 hours ago | parent | prev [-] | | I'm not sure if you're just unaware or purposefully dense. It's absolutely possible to get those numbers for certain models in a m4 max and it's averaged over many tokens, I was just getting 127tok/s for 700 token response on a 24b MoE model yesterday. I tend to use Qwen 3 Coder Next the most which is closer to 65 or 70 tok/s, but absolutely usable for dev work. I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense. | | |
| ▲ | fooblaster 8 hours ago | parent [-] | | what inference runtime are you using? You mentioned mlx but I didn't think anyone was using that for local llms | | |
| ▲ | kamranjon 8 hours ago | parent | next [-] | | LM Studio (which prioritizes MLX models if you're on Mac and they are available) - I have it setup with tailscale running as a server on my personal laptop. So when I'm working I can connect to it from my work laptop, from wherever I might be, and it's integrated through the Zed editor using its built in agent - it's pretty seamless. Then whenever I want to use my personal laptop I just unload the model and do other things. It's a really nice setup, definitely happy I got the 128gb mbp because I do a lot of video editing and 3d rendering work as a hobby/for fun and it's sorta dual purpose in that way, I can take advantage of the compute power when I'm not actually on the machine by setting it up as a LLM server. | |
| ▲ | pram 6 hours ago | parent | prev [-] | | LM Studio has had an MLX engine and models since 2024. |
|
|
|