Remix.run Logo
einpoklum 5 days ago

> It's not clear from the above what a "CUDA core" (singular) _is_

A CUDA core is basically a SIMD lane on an actual core on an NVIDIA GPUs.

For a longer version of this answer: https://stackoverflow.com/a/48130362/1593077

pklausler 5 days ago | parent [-]

So it's a "SIMD lane" that can itself perform actual SIMD instructions?

I think you want a metaphor that doesn't also depend on its literal meaning.

corysama 5 days ago | parent | next [-]

Nvidia’s marketing team uses confusing terminology to make their product sound cooler than it is.

An Intel “core” can perform AVX512 SIMD instructions that involve 16 lanes of 32-bit data. Intel cores are packaged in groups of up to 16. And, they use hyperthreading, speculative execution and shadow registers to cover latency.

An Nvidia “Streaming Multiprocessor” can perform SIMD instructions on 32 lanes of 32-bits each. Nvidia calls these lanes “cores” to make it feel like one GPU can compete with thousands of Intel CPUs.

Simpler terminology would be: an Nvidia H100 has 114 SM Cores, each with four 32-wide SIMD execution units (where basic instructions have a latency of 4 cycles) and each with four Tensor cores. That’s a lot more capability than a high-end Intel CPUs, but not 14,592 times more.

The CUDA API presents a “CUDA Core” (single SIMD lane) as if it was a thread. But, for most purposes it is actually a single SIMD lane in the 32-wide “Warp”. Lots of caveats apply in the details though.

bee_rider 5 days ago | parent | next [-]

I guess “GPUs for people who are already CPU experts” is a blog post that already exists out there. But if it doesn’t, you should go write it, haha.

shaklee3 4 days ago | parent | prev [-]

This is not true. GPUs are SIMT, but any given thread in those 32 in a warp can also issue SIMD instructions. see vector loads

saltcured 5 days ago | parent | prev | next [-]

It's all very circular, if you try to avoid the architecture-specific details of individual hardware designs. A SIMD "lane" is roughly equivalent to an ALU (arithmetic logic unit) in a conventional CPU design. Conceptually, it processes one primitive operation such as add, multiple, or FMA (fused-multiply-add) at a time on scalar values.

Each such scalar operation is on a fixed width primitive number, which is where we get into the questions of what numeric types the hardware supports. E.g. we used to worry about 32 vs 64 bit support in GPUs and now everything is worrying about smaller widths. Some image processing tasks benefit from 8 or 16 bit values. Lately, people are dipping into heavily quantized models that can benefit from even narrower values. The narrower values mean smaller memory footprint, but also generally mean that you can do more parallel operations with "similar" amounts of logic since each ALU processes fewer bits.

Where this lane==ALU analogy stumbles is when you get into all the details about how these ALUs are ganged together or in fact repartitioned on the fly. E.g. a SIMD group of lanes share some control signals and are not truly independent computation streams. Different memory architectures and superscalar designs also blur the ability to count computational throughput, as the number of operations that can retire per cycle becomes very task-dependent due to memory or port contention inside these beasts.

And if a system can reconfigure the lane width, it may effectively change a wide ALU into N logically smaller ALUs that reuse most of the same gates. Or, it might redirect some tasks to a completely different set of narrower hardware lanes that are otherwise idle. The dynamic ALU splitting was the conventional story around desktop SIMD, but I think is less true in modern designs. AFAICT, modern designs seem more likely to have some dedicated chip regions that go idle when they are not processing specific widths.

einpoklum 5 days ago | parent | prev | next [-]

> that can itself perform actual SIMD instructions?

Mostly, no; it can't really perform actual SIMD instructions itself. If you look at the SASS (the assembly language used on NVIDIA GPUs) I don't believe you'll see anything like that.

In high-level code, you do have expressions involving "vectorized types", which look like they would translate into SIMD instruction, but they 'serialize', at the single thread level.

There are exceptions to this though, like FP16 operations which might work on 2xFP16 32-bit registers, and other cases. But that is not the rule.

pklausler 5 days ago | parent | next [-]

Please see https://docs.nvidia.com/cuda/parallel-thread-execution/index....

einpoklum 4 days ago | parent [-]

The "video instructions" are indeed another exception: Operations on sub-lanes of 32-bit values: 2x16 or 4x8. This is relevant for graphics/video work, where you often have Red, Green, Blue, Alpha channels of 8 bits each. Their use is uncommon (AFAICT) in CUDA compute work.

shaklee3 4 days ago | parent | prev [-]

not true; there are a lot of simd instructions on GPUs

einpoklum 4 days ago | parent [-]

Such as?

shaklee3 4 days ago | parent [-]

dp4a, ldg. just Google it. there's a whole page of them

5 days ago | parent | prev | next [-]
[deleted]
pavlov 5 days ago | parent | prev [-]

Nvidia calls their SIMD lanes “CUDA cores” for marketing reasons.