|
| ▲ | ingenieroariel 4 hours ago | parent | next [-] |
| With Apple devices you get very fast predictions once it gets going but it is inferior to nvidia precisely during prefetch (processing prompt/context) before it really gets going. For our code assistant use cases the local inference on Macs will tend to favor workflows where there is a lot of generation and little reading and this is the opposite of how many of use use Claude Code. Source: I started getting Mac Studios with max ram as soon as the first llama model was released. |
| |
| ▲ | Aurornis 4 hours ago | parent | next [-] | | > With Apple devices you get very fast predictions once it gets going but it is inferior to nvidia precisely during prefetch (processing prompt/context) before it really gets going I have a Mac and an nVidia build and I’m not disagreeing But nobody is building a useful nVidia LLM box for the price of a $500 Mac Mini You’re also not getting as much RAM as a Mac Studio unless you’re stacking multiple $8,000 nVidia RTX 6000s. There is always something faster in LLM hardware. Apple is popular for the price points of average consumers. | |
| ▲ | storus 4 hours ago | parent | prev | next [-] | | This. It's awful to wait 15 minutes for M3 Ultra to start generating tokens when your coding agent has 100k+ tokens in its context. This can be partially offset by adding DGX Spark to accelerate this phase. M5 Ultra should be like DGX Spark for prefill and M3 Ultra for token generation but who know when it will pop up and for how much? And it still will be at around 3080 GPU levels just with 512GB RAM. | |
| ▲ | zozbot234 4 hours ago | parent | prev | next [-] | | All Apple devices have a NPU which is potentially able to save power for compute bound operations like prefill (at least if you're ok with FP16 FMA/INT8 MADD arithmetic). It's just a matter of hooking up support to the main local AI frameworks. This is not a speedup per se but gives you more headroom wrt. power and thermals for everything else, so should yield higher performance overall. | | |
| ▲ | d3k 3 hours ago | parent [-] | | AFAIK, only CoreML can use Apple's NPU (ANE). Pytorch, MLX and the other kids on the block use MPS (the GPU). I think the limitations you mentioned relate to that (but I might be missing something) |
| |
| ▲ | FuckButtons 3 hours ago | parent | prev [-] | | Vllm-mlx with prefix caching helps with this. |
|
|
| ▲ | ac29 3 hours ago | parent | prev | next [-] |
| > a $500 Mac Mini has memory bandwidth that you just can’t get anywhere else for the price. The cheapest new mac mini is $600 on Apple's US store. And it has a 128-bit memory interface using LPDDR5X/7500, nothing exotic. The laptop I bought last year for <$500 has roughly the same memory speed and new machines are even faster. |
| |
| ▲ | jsheard 3 hours ago | parent [-] | | > The cheapest new mac mini is $600 on Apple's US store. And you're only getting 16GB at that base spec. It's $1000 for 32GB, or $2000 for 64GB plus the requisite SOC upgrade. > And it has a 128-bit memory interface using LPDDR5X/7500, nothing exotic. Yeah, 128-bit is table stakes and AMD is making 256-bit SOCs as well now. Apple's higher end Max/Ultra chips are the ones which stand out with their 512 and 1024-bit interfaces. Those have no direct competition. |
|
|
| ▲ | zozbot234 5 hours ago | parent | prev | next [-] |
| And then only Apple devices have 512GB of unified memory, which matters when you have to combine larger models (even MoE) with the bigger context/KV caching you need for agentic workflows. You can make do with less, but only by slowing things down a whole lot. |
|
| ▲ | pja 2 hours ago | parent | prev | next [-] |
| Only the M4 Pro Mac Minis have faster RAM than you’ll get in an off-the-shelf Intel/AMD laptop. The M4 Pros start at $1399. You want the M4 Max (or Ultra) in the Mac Studios to get the real stuff. |
|
| ▲ | cmrdporcupine 5 hours ago | parent | prev [-] |
| But a $500 Mac Mini has nowhere near the memory capacity to run such a model. You'd need at least 2 512GB machines chained together to run this model. Maybe 1 if you quantized the crap out of it. And Apple completely overcharges for memory, so. This is a model you use via a cheap API provider like DeepInfra, or get on their coding plan. It's nice that it will be available as open weights, but not practical for mere mortals to run. But I can see a large corporation that wants to avoid sending code offsite setting up their own private infra to host it. |
| |
| ▲ | zozbot234 5 hours ago | parent [-] | | The needed memory capacity depends on active parameters (not the same as total with a MoE model) and context length for the purpose of KV caching. Even then the KV cache can be pushed to system RAM and even farther out to swap, since writes to it are small (just one KV vector per token). |
|