| ▲ | llmoorator 3 hours ago | |
you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular. meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers. Like storing small 8 bit numbers in full 32 bit integers. So it's not close to 100% of unquantized BF16. I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue. That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit... | ||
| ▲ | coder543 an hour ago | parent | next [-] | |
> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers. You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model. The Gemma 3 QAT report was a bit clearer: https://developers.googleblog.com/en/gemma-3-quantized-aware... "Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0." The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly. | ||
| ▲ | satvikpendem 2 hours ago | parent | prev [-] | |
Ah I see, thanks for the clarification. | ||