▲ | VierScar 9 days ago | |||||||
No I don't think it's the bits. I would say it's the computation. Inference requires performing a lot of matmul, and with more tokens the number of computation operations increases exponentially - O(n^2) at least. So increasing your context/conversation will quickly degrade performance I seriously doubt it's the throughput of memory during inference that's the bottleneck here. | ||||||||
▲ | MereInterest 9 days ago | parent | next [-] | |||||||
Nitpick: O(n^2) is quadratic, not exponential. For it to “increase exponentially”, n would need to be in the exponent, such as O(2^n). | ||||||||
| ||||||||
▲ | zozbot234 9 days ago | parent | prev | next [-] | |||||||
Typically, the token generation phase is memory-bound for LLM inference in general, and this becomes especially clear as context length increases (since the model's parameters are a fixed quantity.) If it was pure compute bound there would be huge gains to be had by shifting some of the load to the NPU (ANE) but AIUI it's just not so. | ||||||||
▲ | summarity 9 days ago | parent | prev [-] | |||||||
It literally is. LLM inference is almost entirely memory bound. In fact for naive inference (no batching), you can calculate the token throughput just based on the model size, context size and memory bandwidth. | ||||||||
|