| ▲ | ComputerGuru 6 hours ago | |
I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe. | ||
| ▲ | atq2119 5 hours ago | parent | next [-] | |
Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory. Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP. | ||
| ▲ | verdverm 6 hours ago | parent | prev [-] | |
I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on | ||