▲ | martinald 5 days ago | |||||||
Thanks for the correction (author here). I'll update the article - very fair point on compute on input tokens which I messed up. Tbh I'm pleased my napkin math was only 7x off the laws of physics :). Even rerunning the math on my use cases with way higher input token cost doesn't change much though. | ||||||||
▲ | chillee 5 days ago | parent [-] | |||||||
The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode. The component about requiring long context lengths to be compute-bound for attention is also quite misleading. | ||||||||
|