| ▲ | sroussey 2 hours ago | |
Super interesting! > - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding. Where can I find more info on this? I’d like to convert models to onnx this way. > - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4. Where can I find more info on this? I’d like to convert models to onnx this way. The most difficult environment for small models is in the browser. Would be great to push the SOTA in that environment. | ||