| ▲ | tarruda 7 hours ago | ||||||||||||||||||||||||||||
I can't say anything about the OP method, but I already tested the smol-IQ2_XS quant (which has 2.46 BPW) with the pi harness. I did not do a very long session because token generation and prompt processing gets very slow, but I think I worked for up to ~70k context and it maintained a lot of coherence in the session. IIRC the GPQA diamond is supposed to exercise long chains of thought and it scored exceptionally well with 82% (the original BF16 official number is 88%: https://huggingface.co/Qwen/Qwen3.5-397B-A17B). Note that not all quants are the same at a certain BPW. The smol-IQ2_XS quant I linked is pretty dynamic, with some tensors having q8_0 type, some q6_k and some q4_k (while the majority is iq2_xs). In my testing, this smol-IQ2_XS quant is the best available at this BPW range. Eventually I might try a more practical eval such as terminal bench. | |||||||||||||||||||||||||||||
| ▲ | Aurornis 7 hours ago | parent [-] | ||||||||||||||||||||||||||||
> I did not do a very long session This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time. Running a smaller dense model like 27B produces better results than 2-bit quants of larger models in my experience. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||