| ▲ | tosh 5 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||
re comments: yes of course this is apples to oranges but that's kind of the point it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU the interesting thing is why that is so CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, … | |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | p1esk 5 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||||||||
A100 FP32 throughput “at its limit”: 19.5 TFLOP/s. AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz). | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||