I guess we will have a lot more benefits if we can get this to work on something like llama.cpp - since it really has a lot of kernels for different quantizations, a lot of home users, high hardware diversity - so it is a likely place with highest bang for the buck.

I guess they can be a contributor there.

▲

LuxBennu 3 hours ago | parent [-]

This is the right call. llama.cpp has dozens of hand-tuned CUDA kernels across Q4_K_M, Q5_K_S, Q8_0 and other quant formats, each targeting different hardware profiles. An autoresearch approach that could optimize these per-GPU would be huge — right now performance varies wildly between, say, an RTX 3090 and a 5070 Ti on the same quant format because the kernels are tuned for specific architectures. The hardware diversity in the llama.cpp user base is exactly where automated kernel search has the most to gain.

	▲	Jhsto 2 hours ago \| parent [-]
		If I'd like to benchmark a new language / compile backend for LLM inference, what would be some good projects to try? If I'd start from tinygpt, what would make sense as the next step?