| ▲ | egorfine 4 hours ago | |||||||
Related: I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (Q4) runs about 8x more t/s on M5 Pro and loads 2x times faster from disk to memory. Gonna run some more tests later today. | ||||||||
| ▲ | Confiks 3 hours ago | parent [-] | |||||||
> The same Gemma 4 MoE model (Q4) As you have so much RAM I would suggest running Q8_0 directly. It's not slower (perhaps except for the initial model load), and might even be faster, while being almost identical in quality to the original model. And just to be sure: you're are running the MLX version, right? The mlx-community quantization seemed to be broken when I tried it last week (it spit out garbage), so I downloaded the unsloth version instead. That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch of https://github.com/ml-explore/mlx-lm. I unfortunately only have 16 GiB of RAM on a Macbook M1, but I just tried to run the Q8_0 GGUF version on a 2023 AMD Framework 13 with 64 GiB RAM just using the CPU, and that works surprisingly well with tokens/s much faster than I can read the output. The prompt cache is also very useful to quickly insert a large system prompt or file to datamine although there are probably better ways to do that instead of manually through a script. | ||||||||
| ||||||||