They're honestly not competitive for inference, it's why datacenters largely ignore Apple Silicon. Even the M5 Max is still bottlenecked for dense models due to the relatively weak GPU and paltry ~500-600gb/s of GPU memory bandwidth. For reference, the RTX 5080 (a consumer GPU) has 1tb of VRAM bandwidth and runs circles around the M5 Max in GPU compute benchmarks: https://browser.geekbench.com/opencl-benchmarks

Even for home inference, it's hard to recommend a dedicated Mac over a cheap Nvidia server box.

> They are probably the only ones that have the talent, resources, and capital to do that.

Apple invented OpenCL. The problem was their reluctance to work with the rest of the industry, and once CUDA took over it was too late for them to even try.

▲

seanmcdirmid 9 days ago | parent [-]

> For reference, the RTX 5080 (a consumer GPU) has 1tb of VRAM bandwidth and runs circles around the M5 Max in GPU compute benchmarks: https://browser.geekbench.com/opencl-benchmarks

NVIDIA hampers their GPUs with un-unified graphics memory, while the M series can use everything the computer has (well, you need to save 4GB or so). It also works on airplanes and in hotel rooms, a cheap NVIDIA server box with 64GB of RAM (what my M3 Max laptop has)....how cheap is that?

	▲	andriy_koval 9 days ago \| parent [-]
		I think un-unified memory issue is solved by software layer in datacenter setting: model is distributed across multiple GPUs in the same server, or across multiple servers if model is extra large.