| ▲ | sudo_ls_ads 3 hours ago | |
Author here. Quick context on what made this worth writing up: Gemma 4 26B A4B is an MoE — 26B total params, 4B active per token — which fundamentally changes what’s viable on a single consumer GPU. During decode you pay the memory bandwidth cost of a 4B model but get the quality of a 26B. That’s what makes a 5090 a real option for it; a dense 26B wouldn’t be. The interesting part was the quant format choice. NVFP4 is Blackwell’s native 4-bit format and theoretically the fastest path, but MoE support for Gemma 4 specifically was blocked on an unmerged vLLM PR (#39045) — linear layers loaded, expert weights didn’t. Falling back to nightly didn’t help because that day’s nightly was broken by someone landing an unconditional pandas import in the AITER code path without updating the image’s deps. Ended up on AWQ + Marlin kernels, which has been stable in vLLM for over a year. For single-user memory-bandwidth-bound decode the gap to NVFP4 is smaller than you’d expect — both hit the same 4x weight compression, and AWQ dequantizes to FP16 in-register rather than using FP4 tensor cores. I’m getting ~196 tok/s; I’d estimate NVFP4 would be 220-240 if it had worked. Happy to dig into the vLLM config, the RunPod Serverless side, or the NVFP4 vs AWQ tradeoff in more depth. | ||