▲ | ethan_smith 9 days ago | |
For Intel CPUs, Phi-2 (2.7B) and TinyLlama (1.1B) run reasonably well using llama.cpp with 4-bit quantization. GGUF models with INT4 quantization typically need ~2GB RAM per billion parameters, so even older machines can handle smaller models. | ||
▲ | akawry 8 days ago | parent [-] | |
Take a look at ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp CPU performance is much better than mainline llama, as well as having more quantization types available |