Remix clone Hacker News

new | show | ask | jobs Github

	▲	summarity 9 days ago
		It literally is. LLM inference is almost entirely memory bound. In fact for naive inference (no batching), you can calculate the token throughput just based on the model size, context size and memory bandwidth.
	▲	zozbot234 9 days ago \| parent [-]
		Prompt pre-processing (before the first token is output) is raw compute-bound. That's why it would be nice if we could direct llama.cpp/ollama to run that phase only on iGPU/NPU (for systems without a separate dGPU, obviously) and shift the whole thing over to CPU inference for the latter token-generation phase. (A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)