| ▲ | Aurornis 3 hours ago | |||||||
Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality. With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations. | ||||||||
| ▲ | kgeist an hour ago | parent | next [-] | |||||||
>I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations. There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends. | ||||||||
| ||||||||
| ▲ | freedomben 3 hours ago | parent | prev [-] | |||||||
I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers. | ||||||||
| ||||||||