| ▲ | cdavid 7 hours ago | |
Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU. The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD. See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C. | ||
| ▲ | p1esk 5 hours ago | parent [-] | |
Theoretical FP32 performance of AMD EPYC 9965 is double that of A100: 41.2 TFLOP/s vs 19.5 TFLOP/s | ||