Remix.run Logo
coder543 an hour ago

> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.

The Gemma 3 QAT report was a bit clearer:

https://developers.googleblog.com/en/gemma-3-quantized-aware...

"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."

The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.