▲ | p1esk 15 hours ago | |
Good post, thank you! On an A100 80GB we get 312 teraflops per second of float16 compute and 1.5 TB/s of memory bandwidth, and this ratio comes out to roughly 208 tokens. Few thoughts: 1. One token != one byte 2. Your prompt ("Edgar Allan Poe is a”) is short (<<300 tokens) 3. Both flops and memory bandwidth for A100 are theoretical maximums. Reality is usually very different and is workload dependent. | ||
▲ | njkumarr 10 hours ago | parent [-] | |
Thank you for taking the time to read my article! For your 2nd point, to clarify I actually generate 300 new tokens on top of that initial prompt, not just using the short prompt, so with precomputation of the prompt + token generation it should come out to about 306 tokens. For your 1st and 3rd point you are definitely correct, looking back, I should've focused probably on using the torch profiler to track what point my CPU overhead started to decrease in order to assess compute-bound regions in my workflow better, rather than napkin math on A100 specs. |