| ▲ | satvikpendem 3 hours ago | |||||||||||||
Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article. Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones. | ||||||||||||||
| ▲ | llmoorator 3 hours ago | parent | next [-] | |||||||||||||
you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular. meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers. Like storing small 8 bit numbers in full 32 bit integers. So it's not close to 100% of unquantized BF16. I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue. That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit... | ||||||||||||||
| ||||||||||||||
| ▲ | slopinthebag 3 hours ago | parent | prev [-] | |||||||||||||
I'm confused, the unsloth model is ~600mb and the one from google is 7gb? | ||||||||||||||
| ||||||||||||||