So this is probably not good news for the MacBook Ultra with 512GB of RAM rumors being..affordable.

What's worse is that this is probably going to get worse. My angel investment group is getting inundated with pitches that amount to building an RX-6000 with 96GB of RAM and installing a local model to do "thing X".

So even if the OpenAI's of the world stop trying to use up all the RAM, you're going to have thousands of start-ups pushing local models.

▲

thewebguyd 8 hours ago | parent | next [-]

Makes me really wonder about that new Surface Ultra pricing with the nvidia chip in it.

If Apple can't pull it off with their supply chain weight they can throw around, what is that thing going to be priced at? Microsoft/Nvidia are either going to be subsidizing it or it's gotta be close to $8,000+ at launch.

▲

drnick1 7 hours ago | parent | prev [-]

> So this is probably not good news for the MacBook Ultra with 512GB of RAM rumors being..affordable.

Why would anyone need that much RAM in a laptop?

▲

bitmasher9 7 hours ago | parent [-]

512GB unified memory is targeting local inference of large models, or local training of non-frontier models.

▲

drnick1 7 hours ago | parent [-]

I doubt you can run a model that requires hundreds of GB of RAM at an acceptable speed (tok/s) on a MacBook.

▲

aroman 5 hours ago | parent [-]

What would be the bottleneck?

▲

bigyabai 2 hours ago | parent [-]

The integrated GPU. Not enough compute onboard to handle prefill for 100gb+ models, and the decode is constrained by memory bandwidth that's lower than most dGPUs that price.

Apple would be in a much stronger spot right now if they didn't pretend like eGPUs were inconceivable black magic that Macs are incompatible with.

▲

aroman an hour ago | parent [-]

I'm not sure I follow - 614 GB/sec is pretty squarely in dGPU territory (~5070 level). External GPUs can definitely exceed that on the very high end, but it seems pretty competitive, no?

	▲	bigyabai 44 minutes ago \| parent [-]
		Competitive for 16-24GB dGPUs, but for 100gb+ inference workloads it's going to be a decode bottleneck. For smaller models it'd be fine, but the same goes for the smaller GPUs. In particular though, the fatal bottleneck is the weakness of the iGPU. Filling a KV cache on a 100gb+ model could take a few minutes, or even hours if you're trying to restore a 256k-to-1m token session.