| ▲ | gonzalohm 4 hours ago | |
Did you double the tokens per second by adding a second GPU or was the increase significantly less? | ||
| ▲ | horsawlarway 4 hours ago | parent | next [-] | |
No real change in inference speed. It basically just allows me to slot in more context or a bigger model. A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM. Sometimes that matters, a lot of times it doesn't. On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures. I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s). | ||
| ▲ | mirekrusin 4 hours ago | parent | prev [-] | |
You’re adding extra gpu for more vram, not speed. | ||