Remix.run Logo
Calculating GPT-2's Inference Speedups(njkumar.com)
2 points by njkumarr 14 hours ago | 2 comments
p1esk 13 hours ago | parent [-]

Good post, thank you!

On an A100 80GB we get 312 teraflops per second of float16 compute and 1.5 TB/s of memory bandwidth, and this ratio comes out to roughly 208 tokens.

Few thoughts:

1. One token != one byte

2. Your prompt ("Edgar Allan Poe is a”) is short (<<300 tokens)

3. Both flops and memory bandwidth for A100 are theoretical maximums. Reality is usually very different and is workload dependent.

njkumarr 7 hours ago | parent [-]

Thank you for taking the time to read my article!

For your 2nd point, to clarify I actually generate 300 new tokens on top of that initial prompt, not just using the short prompt, so with precomputation of the prompt + token generation it should come out to about 306 tokens.

For your 1st and 3rd point you are definitely correct, looking back, I should've focused probably on using the torch profiler to track what point my CPU overhead started to decrease in order to assess compute-bound regions in my workflow better, rather than napkin math on A100 specs.