Remix clone Hacker News

new | show | ask | jobs Github

	▲	Aurornis an hour ago
		Prefill (prompt processing) is compute bound doing large matrix operations. Token generation (aka tokens/s) is memory bandwidth bound. The RTX 5090 has an incredible amount of compute performance for matrix operations and a lot of memory bandwidth. The Apple Silicon parts have unusually high memory bandwidth for general purpose compute chips, which is why they can generate tokens so fast. Their raw matrix compute performance is amazing for their power envelope but not nearly as fast as a dedicated GPU consuming 400-500W. Apple added tensor cores on the M5 generation which help with those matrix operations, which is why the M5 performs so much better than the M4 Max in that article. Dedicate GPUs like the RTX 5090 are in another league, though. You can see the divergence in the high resolution gaming benchmarks, too. Once he starts benchmarking at 4K or 6K where the CPU emulation stops being a bottleneck, the raw compute of the 5090 completely crushes any of the Apple Silicon GPUs.