▲ | iamnotagenius a day ago | ||||||||||||||||
Interesting, but not exactly practical for a local LLM user, as 4-bit is how LLM's are run locally. | |||||||||||||||||
▲ | sroussey a day ago | parent | next [-] | ||||||||||||||||
True, but their research did include running on 5080 local. The big take away, in my opinion, is that their technique for LUTs etc could also be applied to lossy quants as well. Say maybe you get 5bit accuracy in size of 4bit? I don’t know, but maybe? Also their two stage design might make current quantized you kernal designs better. | |||||||||||||||||
| |||||||||||||||||
▲ | gojomo a day ago | parent | prev [-] | ||||||||||||||||
Some might prefer the fidelity of this method's 70% savings over the lossyness of 4-bit quantization's 75%. And, maybe the methods stack for those willing to trade both costs for the smallest representation. | |||||||||||||||||
|