llama.cpp (b8642) auto-fits ~200k context on this 24GB RX 7900 XTX & it shows a solid 100+ tok/s ("S_TG t/s") on the first 32k of it, nice!
./llama-batched-bench -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
-npp 1000,2000,4000,8000,16000,32000,64000,96000,128000 -ntg 128 -npl 1 -c 0
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 0.416 | 2404.87 | 1.064 | 120.29 | 1.480 | 762.20 |
| 2000 | 128 | 1 | 2128 | 0.755 | 2649.86 | 1.075 | 119.04 | 1.830 | 1162.83 |
| 4000 | 128 | 1 | 4128 | 1.501 | 2665.72 | 1.093 | 117.08 | 2.594 | 1591.49 |
| 8000 | 128 | 1 | 8128 | 3.142 | 2545.85 | 1.114 | 114.87 | 4.257 | 1909.47 |
| 16000 | 128 | 1 | 16128 | 6.908 | 2316.00 | 1.189 | 107.65 | 8.097 | 1991.73 |
| 32000 | 128 | 1 | 32128 | 16.382 | 1953.31 | 1.278 | 100.12 | 17.661 | 1819.16 |
| 64000 | 128 | 1 | 64128 | 43.427 | 1473.74 | 1.453 | 88.12 | 44.879 | 1428.89 |
| 96000 | 128 | 1 | 96128 | 82.227 | 1167.50 | 1.623 | 78.86 | 83.850 | 1146.42 |
|128000 | 128 | 1 | 128128 | 133.237 | 960.69 | 1.797 | 71.25 | 135.034 | 948.86 |