▲ | vid 3 days ago | |||||||
People are running GPT OSS 120b at 46 tokens per second on Strix Halo systems, which is quite usable and a fraction of the cost of a 128GB NVidia or Apple system. Apple's GPU isn't that strong, so real competition to Apple and NVidia can be created. | ||||||||
▲ | 827a 3 days ago | parent [-] | |||||||
Exactly yeah, my point is that there's a lot more to running these models than just the raw memory bandwidth and GPU-available memory size, and the difference between a $6000 M4 Ultra Mac Studio and a $2000 AI Max 395+ isn't actually as big as the raw numbers would suggest. On the flip-side, though: Running GPT-OSS-120b locally is "cool", but have people found useful, productivity enhancing use-cases which justify doing this over just loading $2000 into your OpenAI API account? That, I'm less sure of. I think we'll get to the point where running a local-first AI stack is obviously an awesome choice; I just don't think the hardware or models are there yet. Next-year's Medusa Halo, combined with another year of open source model improvements might be the inflection point. | ||||||||
|