re comments:

yes of course this is apples to oranges but that's kind of the point

it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU

the interesting thing is why that is so

CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …

▲

p1esk 5 hours ago | parent [-]

A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.

AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).

▲

zzzoom 14 minutes ago | parent | next [-]

EPYC 9965: 614GBps of 12-channel DDR5-6400

A100: 1935GBps of HBM2e

Most of those FLOPS are constrained by memory bandwidth.

▲

an hour ago | parent | prev | next [-]

[deleted]

▲

tosh 5 hours ago | parent | prev [-]

A100: 312 TFLOP/s for FP16

but it is very impressive how far modern CPUs get as well (also in smart phones!)

▲

p1esk 4 hours ago | parent [-]

Intel Xeon 6980P: 128 cores x 1024 FP16 FLOP/cycle/core x 3.2 GHz: 419 TFLOP/s

▲

tosh 3 hours ago | parent [-]

I'm not saying "GPU more brrt than CPU"

I found the comparison interesting

on Intel Xeon 690P with 419 TFLOP/s it is still (maybe even more?) interesting to ask:

how much throughput can you reach with Python, Python with lib x, y, z, with C++ like this, with C++ like that etc etc and why?

no?

	▲	p1esk 3 hours ago \| parent [-]
		No one in their right mind would use pure Python to do matrix multiplication. It’s like using a screwdriver to hammer nails into wood. But this discussion is even more bizarre than comparing a screwdriver to a hammer, it’s like comparing a screwdriver to a nail.