Extremely impressive, but can one really run these >200B param models on prem in any cost effective way? Even if you get your hands on cards with 80GB ram, you still need to tie them together in a low-latency high-BW manner.

It seems to me that small/medium sized players would still need a third party to get inference going on these frontier-quality models, and we're not in a fully self-owned self-hosted place yet. I'd love to be proven wrong though.

▲

Borealid 2 hours ago | parent | next [-]

A Framework Desktop exposes 96GB of RAM for inference and costs a few thou USD.

▲

michaelanckaert an hour ago | parent [-]

You need memory on the GPU, not in the system itself (unless you have unified memory such as the M-architecture). So we're talking about cards like the H200 that have 141GB of memory and cost between 25 to 40k.

▲

Borealid an hour ago | parent [-]

Did you casually glance at how the hardware in the Framework Desktop (Strix Halo) works before commenting?

	▲	michaelanckaert an hour ago \| parent \| next [-]
		I didn't glace at it, I read it :-) The architecture is a 'unified memory bus', so yes the GPU has access to that memory. My comment was a bit unfortunate as it implied I didn't agree with yours, sorry for that. I simply want to clarify that there's a difference between 'GPU memory' and 'system memory'. The Frame.work desktop is a nice deal. I wouldn't buy the Ryzen AI+ myself, from what I read it maxes out at about 60 tokens / sec which is low for my use cases.
	▲	ramon156 39 minutes ago \| parent \| prev [-]
		These don't run 200B models at all, results show it can run 13B at best. 70B is ~3 tk / s according to someone on Reddit.

▲

buyucu 36 minutes ago | parent | prev [-]

I'm running them on GMKTec Evo 2.