Remix.run Logo
loudmax 4 days ago

There was an interesting post to r/LocalLLaMA yesterday from someone running inference mostly on CPU: https://carteakey.dev/optimizing%20gpt-oss-120b-local%20infe...

One of the observations is how much difference memory speed and bandwidth makes, even for CPU inference. Obviously a CPU isn't going to match a GPU for inference speed, but it's an affordable way to run much larger models than you can fit in 24GB or even 48GB of VRAM. If you do run inference on a CPU, you might benefit from some of the same memory optimizations made by gamers: favoring low-latency overclocked RAM.

mistercheph 3 days ago | parent [-]

Outside of prompt processing, the only reason GPU's are better than CPU's for inference is memory bandwidth, the performance of apple M* devices at inference is a consequence of this, not of their UMA.