| ▲ | lambda 11 hours ago | |
Generally the bottleneck is RAM throughput. Inference, in particular token generation, especially on a single user instance, is not all that computationally complex; you're doing some fairly simple calculations for each parameter, the time is dominated by just transferring each parameter from RAM to the cores. A 31B dense model like Gemma 4 has to transfer 31B parameters (at 16 bits per parameter for the full model, though on consumer hardware people generally run 4-8 bit quantizations) from RAM to the cores, that's a lot of memory transfer. Prompt processing or parallel token generation can do a bit more work per memory transfer, as you can use the same weights for a few different calculations in parallel. But even still, memory bandwidth is a huge factor. | ||
| ▲ | 8 hours ago | parent [-] | |
| [deleted] | ||