You need memory on the GPU, not in the system itself (unless you have unified memory such as the M-architecture). So we're talking about cards like the H200 that have 141GB of memory and cost between 25 to 40k.

▲

Borealid 3 hours ago | parent [-]

Did you casually glance at how the hardware in the Framework Desktop (Strix Halo) works before commenting?

▲

michaelanckaert 3 hours ago | parent | next [-]

I didn't glace at it, I read it :-) The architecture is a 'unified memory bus', so yes the GPU has access to that memory.

My comment was a bit unfortunate as it implied I didn't agree with yours, sorry for that. I simply want to clarify that there's a difference between 'GPU memory' and 'system memory'.

The Frame.work desktop is a nice deal. I wouldn't buy the Ryzen AI+ myself, from what I read it maxes out at about 60 tokens / sec which is low for my use cases.

▲

ramon156 3 hours ago | parent | prev [-]

These don't run 200B models at all, results show it can run 13B at best. 70B is ~3 tk / s according to someone on Reddit.

	▲	Borealid 2 hours ago \| parent [-]
		I don't know where you've got those numbers, but they're wrong. https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/inferen... seems comparable to the Framework Desktop and reputable - they didn't just quote a number, they showed benchmark output. I get far more than 3 t/s for a 70B model on normal non-unified RAM, so that's completely unfeasible performance for a unified memory architecture like Halo.