Physics: You always have the same memory bandwidth. The longer the context, the more bits will need to pass through the same pipe. Context is cumulative.

▲

VierScar 9 days ago | parent [-]

No I don't think it's the bits. I would say it's the computation. Inference requires performing a lot of matmul, and with more tokens the number of computation operations increases exponentially - O(n^2) at least. So increasing your context/conversation will quickly degrade performance

I seriously doubt it's the throughput of memory during inference that's the bottleneck here.

▲

MereInterest 9 days ago | parent | next [-]

Nitpick: O(n^2) is quadratic, not exponential. For it to “increase exponentially”, n would need to be in the exponent, such as O(2^n).

	▲	esafak 8 days ago \| parent [-]
		To contrast with exponential, the term is power law.

▲

zozbot234 9 days ago | parent | prev | next [-]

Typically, the token generation phase is memory-bound for LLM inference in general, and this becomes especially clear as context length increases (since the model's parameters are a fixed quantity.) If it was pure compute bound there would be huge gains to be had by shifting some of the load to the NPU (ANE) but AIUI it's just not so.

▲

summarity 9 days ago | parent | prev [-]

It literally is. LLM inference is almost entirely memory bound. In fact for naive inference (no batching), you can calculate the token throughput just based on the model size, context size and memory bandwidth.

	▲	zozbot234 9 days ago \| parent [-]
		Prompt pre-processing (before the first token is output) is raw compute-bound. That's why it would be nice if we could direct llama.cpp/ollama to run that phase only on iGPU/NPU (for systems without a separate dGPU, obviously) and shift the whole thing over to CPU inference for the latter token-generation phase. (A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)