Remix.run Logo
redrove 5 hours ago

It’s 128b dense model. Good luck getting more than 3t/s out of a mac. It doesn’t matter if it fits or not.

zozbot234 5 hours ago | parent [-]

You could run it on a single Mac Studio with M3 Ultra, or two Mac Studios with M4 Max at higher perf than that. And lightly quantizing this could give us modern dense models in the ~80GB size range, which is a very compelling target.

freakynit 5 hours ago | parent [-]

Wouldn't matter much still. M3 ultra has 819GB/s unified memory bandwidth. That means theoretical max tokem rate is 819/128 =~ 6.39 t/s. At 80 GB (5 bit quantization), its still near about 10 t/s ... far from a good coding experience. Also, these are theoretical max.. real world token generation rates would be at least 15-20% less.