I am always a bit baffled why Apple gets credited with this. Unified memory has been a thing for decades. I can still load the biggest models on my 10th gen Intel Core CPU and the integrated GPU can run inference.

The difference being that modern integrated GPU are just that much faster and can run inference at tolerable speeds.

(Plus NPUs being a thing now, but that also started much earlier. Thr 10th gen Intel Core architecture already had instructions to deal with "AI" workloads... just very preliminary)

▲

mirekrusin 2 days ago | parent | next [-]

That’s shared, not unified, it’s partitioned where cpu and gpu copies are managed by driver. Lunar lake (2024) is getting closer but still not as tightly integrated as apple and capped to 32GB only (Apple has up to 512GB). AMD ryzen ai max is closer to Apple but still 3 times slower memory.

	▲	fc417fc802 2 days ago \| parent \| next [-]
		Shared vs unified is merely a driver implementation detail. Regardless, in practice (IIUC) data is still going to be copied if you perform a transfer using a graphics API because the driver has no way of knowing what the host might do with the pointed-to memory after the transfer. If you make use of host pointers and run on an iGPU no copy will take place.
	▲	fho 2 days ago \| parent \| prev [-]
		My last serious GPU programming was with OpenCL. And if my memory does not fail me the API was quite specific about copying and/or sharing memory on a shared memory system. I am pretty sure that my old 10th gen CPU/GPU combo has the ability to use the "unified"/zero-copy access mode for the GPU.

▲

eis 2 days ago | parent | prev [-]

I don't think people are crediting Apple with inventing unified memory - I certainly did not. There have been similar systems for decades. What Apple did is popularize this with widely available hardware with GPUs that don't totally suck for inference in combination with RAM that has decent speed at an affordable price. You either had iGPUs which were slow (plus not exactly the fastest DDR memory) but at least sitting on the same die or you had fast dGPUs which had their own limited amount of VRAM. So the choice was between direct memory access but not powerfull or powerfull but strangled by having to go through the PCIE subsystem to access RAM.

The article is talking about one particular optimization that one can implement with Apple Silicon and I at least wasn't aware that it is now possible to do so from WebAssembly - so to completely dismiss it as if it had nothing to do with Apple Silicon is imho not fair.

	▲	pjmlp 2 days ago \| parent [-]
		Back in the 8 and 16 bit home computer days, or game consoles for that matter it was popular enough already. And yes things like the Amiga Blitter, arcade or console graphics units were already baby GPUs.