Remix.run Logo
zozbot234 6 hours ago

You could run it on a single Mac Studio with M3 Ultra, or two Mac Studios with M4 Max at higher perf than that. And lightly quantizing this could give us modern dense models in the ~80GB size range, which is a very compelling target.

freakynit 6 hours ago | parent [-]

Wouldn't matter much still. M3 ultra has 819GB/s unified memory bandwidth. That means theoretical max tokem rate is 819/128 =~ 6.39 t/s. At 80 GB (5 bit quantization), its still near about 10 t/s ... far from a good coding experience. Also, these are theoretical max.. real world token generation rates would be at least 15-20% less.