| ▲ | RobotToaster 8 hours ago | |
Isn't the ability to run it more dependant on (V)RAM? With TOPS just dictating the speed at which it runs? | ||
| ▲ | zozbot234 7 hours ago | parent | next [-] | |
Strictly speaking, you don't need that much VRAM or even plain old RAM - just enough to store your context and model activations. It's just that as you run with less and less (V)RAM you'll start to bottleneck on things like SSD transfer bandwidth and your inference speed goes down to a crawl. But even that may or may not be an issue depending on your exact requirements: perhaps you don't need your answer instantly and can wait while it gets computed in the background. Or maybe you're running with the latest PCIe 5 storage which overall gives you comparable bandwidth to something like DDR3/DDR4 memory. | ||
| ▲ | NitpickLawyer 7 hours ago | parent | prev [-] | |
A good rule of thumb is that PP (Prompt Processing) is compute bound while TG (Token Generation) is (V)RAM speed bound. | ||