Remix.run Logo
bitmasher9 7 hours ago

512GB unified memory is targeting local inference of large models, or local training of non-frontier models.

drnick1 7 hours ago | parent [-]

I doubt you can run a model that requires hundreds of GB of RAM at an acceptable speed (tok/s) on a MacBook.

aroman 5 hours ago | parent [-]

What would be the bottleneck?

bigyabai 2 hours ago | parent [-]

The integrated GPU. Not enough compute onboard to handle prefill for 100gb+ models, and the decode is constrained by memory bandwidth that's lower than most dGPUs that price.

Apple would be in a much stronger spot right now if they didn't pretend like eGPUs were inconceivable black magic that Macs are incompatible with.

aroman an hour ago | parent [-]

I'm not sure I follow - 614 GB/sec is pretty squarely in dGPU territory (~5070 level). External GPUs can definitely exceed that on the very high end, but it seems pretty competitive, no?

bigyabai an hour ago | parent [-]

Competitive for 16-24GB dGPUs, but for 100gb+ inference workloads it's going to be a decode bottleneck. For smaller models it'd be fine, but the same goes for the smaller GPUs.

In particular though, the fatal bottleneck is the weakness of the iGPU. Filling a KV cache on a 100gb+ model could take a few minutes, or even hours if you're trying to restore a 256k-to-1m token session.