People are running GPT OSS 120b at 46 tokens per second on Strix Halo systems, which is quite usable and a fraction of the cost of a 128GB NVidia or Apple system. Apple's GPU isn't that strong, so real competition to Apple and NVidia can be created.

▲

827a 3 days ago | parent [-]

Exactly yeah, my point is that there's a lot more to running these models than just the raw memory bandwidth and GPU-available memory size, and the difference between a $6000 M4 Ultra Mac Studio and a $2000 AI Max 395+ isn't actually as big as the raw numbers would suggest.

On the flip-side, though: Running GPT-OSS-120b locally is "cool", but have people found useful, productivity enhancing use-cases which justify doing this over just loading $2000 into your OpenAI API account? That, I'm less sure of.

I think we'll get to the point where running a local-first AI stack is obviously an awesome choice; I just don't think the hardware or models are there yet. Next-year's Medusa Halo, combined with another year of open source model improvements might be the inflection point.

	▲	vid 3 days ago \| parent [-]
		I use local AI fairly often for innocuous queries (health, history, etc) I don't want to feed the spy machines plus I like the hands on aspect, I would use it more if I had more time and while I hear the 120b is pretty good (I mostly use qwen 30b), I would use it a lot more if I could run some of the really great models. Hopefully Medusa Halo will be all that.