▲ | randomgermanguy 4 days ago | |||||||||||||||||||||||||||||||||||||
Depends on how heavy one wants to go with the quants (for Q6-Q4 the AMD Ryzen AI MAX chips seem better/cheaper way to get started). Also the Mac Studio is a bit hampered by its low compute-power, meaning you really can't use a 100b+ dense model, only MoE feasibly without getting multi minute prompt-processing times (assuming 500+ tokens etc.) | ||||||||||||||||||||||||||||||||||||||
▲ | GeekyBear 3 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||
Given the RAM limitations of the first gen Ryzen AI MAX, you have no choice but to go heavy on the quantization of the larger LLMs on that hardware. | ||||||||||||||||||||||||||||||||||||||
▲ | mercutio2 3 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||
Huh? My maxed out Mac Studio gets 60-100 tokens per second on 120B models, with latency on the order of 2 seconds. It was expensive, but slow it is not for small queries. Now, if I want to bump the context window to something huge, it does take 10-20 seconds to respond for agent tasks, but it’s only 2-3x slower than paid cloud models, in my experience. Still a little annoying, and the models aren’t as good, but the gap isn’t nearly as big as you imply, at least for me. | ||||||||||||||||||||||||||||||||||||||
|