Remix clone Hacker News

new | show | ask | jobs Github

	▲	cdavid 7 hours ago
		Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU. The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD. See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.
	▲	p1esk 5 hours ago \| parent [-]
		Theoretical FP32 performance of AMD EPYC 9965 is double that of A100: 41.2 TFLOP/s vs 19.5 TFLOP/s