Remix.run Logo
corysama 5 days ago

Nvidia’s marketing team uses confusing terminology to make their product sound cooler than it is.

An Intel “core” can perform AVX512 SIMD instructions that involve 16 lanes of 32-bit data. Intel cores are packaged in groups of up to 16. And, they use hyperthreading, speculative execution and shadow registers to cover latency.

An Nvidia “Streaming Multiprocessor” can perform SIMD instructions on 32 lanes of 32-bits each. Nvidia calls these lanes “cores” to make it feel like one GPU can compete with thousands of Intel CPUs.

Simpler terminology would be: an Nvidia H100 has 114 SM Cores, each with four 32-wide SIMD execution units (where basic instructions have a latency of 4 cycles) and each with four Tensor cores. That’s a lot more capability than a high-end Intel CPUs, but not 14,592 times more.

The CUDA API presents a “CUDA Core” (single SIMD lane) as if it was a thread. But, for most purposes it is actually a single SIMD lane in the 32-wide “Warp”. Lots of caveats apply in the details though.

bee_rider 5 days ago | parent | next [-]

I guess “GPUs for people who are already CPU experts” is a blog post that already exists out there. But if it doesn’t, you should go write it, haha.

shaklee3 4 days ago | parent | prev [-]

This is not true. GPUs are SIMT, but any given thread in those 32 in a warp can also issue SIMD instructions. see vector loads