Remix.run Logo
nubg a day ago

how many tokens per second do you get?

usagisushi 16 hours ago | parent | next [-]

Not the OP, but their setup must be faster than my 4060 16GB + 3060 12GB setup. Here are my numbers (typical values, N=1):

    Model                         pp (t/s)    tg (t/s)
    Qwen 3.6 27B            900           29
    Qwen 3.6 35B-A3B   2100          85
    Gemma 4 31B            750           28
    Gemma 4 26B-A4B   2500         90
- All models: UD-Q4 w/ MTP. Context size: ~100k (MoE) / ~70k (Dense).

- Layer splitting used. Tensor splitting is ~1.2x faster in TG, but power spikes from 150W to 380W.

cybertim a day ago | parent | prev [-]

I bought two RTX3080s with 20GB during my holiday in china (set me back 700euros) I'm getting 800-1000 input tps and 60-100tps output with Qwen 3.6 27b Q8 (MTP, P2P, 200k context) this feels like opus4.5 level while coding (pi harness). Also easy to just host your own openai compatible api from home this way and still use your MacBook as dev station.