| ▲ | UncleOxidant 4 hours ago | |
Yeah, you can run it locally if you have enough VRAM, but the reports trickling in are saying about 3 tok/sec. This was on a Strix Halo box which definitely has the needed VRAM, but isn't going to have as high mem bandwidth as a GPU card, it's going to be similar on a Mac - that's the dilemma... the unified memory machines have the VRAM, but the bandwidth isn't great for running dense models. This size of a dense model is only going to be runnable (usefully) by very few people who have multiple GPU cards with enough memory to add up to about 70GB. | ||
| ▲ | simjnd 4 hours ago | parent [-] | |
I don't think this is quite correct, a Strix Halo box usually has 256 GB/s memory bandwidth. An M5 Max has 614 GB/s. An M3 Ultra (no M4 or M5 Ultra) has 820 GB/s. It's still not GDDR or HBM territory, but still significantly faster. That's the edge of Apple Silicon for AI. When they scale up the chip they add more memory controllers which adds more channels and more bandwidth. But yeah in the end it's still going to be only a handful of people that can run it. What I meant is that I think researching and developing smaller more powerful model is more interesting than chasing the next 3T parameter model while burning through VC money and squeezing your customer base more and more aggressively. | ||