| ▲ | cafkafk 3 hours ago | ||||||||||||||||
Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers. I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow. I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details. I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort. | |||||||||||||||||
| ▲ | fragmede 2 hours ago | parent | next [-] | ||||||||||||||||
(purple on black is really hard to read) You say it runs "at reading speed". Have you benchmarked it? | |||||||||||||||||
| |||||||||||||||||
| ▲ | arpinum 6 minutes ago | parent | prev [-] | ||||||||||||||||
[dead] | |||||||||||||||||