Half-OT: Anything useful that runs reasonably fast on a regular Intel CPU/GPU?

I did a bunch of research and basically no. Unless you can work with sending a request in the evening and getting the result in the morning.

And you'd need a lot of regular RAM because otherwise you start swapping at which point I think response times end up in days.

This tech is in the Wild West days, for it to be usable by the average person on consumer hardware, I think we'll need to be in 2030+.

▲

ethan_smith 9 days ago | parent | prev [-]

For Intel CPUs, Phi-2 (2.7B) and TinyLlama (1.1B) run reasonably well using llama.cpp with 4-bit quantization. GGUF models with INT4 quantization typically need ~2GB RAM per billion parameters, so even older machines can handle smaller models.

	▲	akawry 8 days ago \| parent [-]
		Take a look at ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp CPU performance is much better than mainline llama, as well as having more quantization types available