An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4-5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference? Do you know what the big OpenAI/Anthropic/Google datacenters are running?

▲

driese 7 hours ago | parent | next [-]

As always: it depends on your needs. Here's a very basic heuristics rundown:

- More RAM: bigger models, more intelligence.

- More FLOPs: higher pre-fill (reading large files and long prompts before answering, the so-called "time to first token").

- More RAM bandwidth: higher token generation (speed of output).

So basically Macs (high RAM, okay bandwidth, lowish FLOPs) can run pretty intelligent models at an okay output speed but will take a long time to reply if you give them a lot of context (like code bases). Consumer GPUs have great speed and pre-fill time, but low RAM, so you need multiple if you want to run large intelligent models. Big boy GPUs like the RTX 6000 have everything (which is why they are so expensive).

There are some more nuances like the difference of Metal vs. CUDA, caching, parallelization etc., but the things above should hold true generally.

	▲	theturtletalks 3 hours ago \| parent [-]
		Do you think Apple will fix prefill speed with the M6 Max MacBook Ultra 128GB?

▲

aiscoming 8 hours ago | parent | prev [-]

[dead]