Remix.run Logo
einpoklum 5 days ago

> that can itself perform actual SIMD instructions?

Mostly, no; it can't really perform actual SIMD instructions itself. If you look at the SASS (the assembly language used on NVIDIA GPUs) I don't believe you'll see anything like that.

In high-level code, you do have expressions involving "vectorized types", which look like they would translate into SIMD instruction, but they 'serialize', at the single thread level.

There are exceptions to this though, like FP16 operations which might work on 2xFP16 32-bit registers, and other cases. But that is not the rule.

pklausler 5 days ago | parent | next [-]

Please see https://docs.nvidia.com/cuda/parallel-thread-execution/index....

einpoklum 4 days ago | parent [-]

The "video instructions" are indeed another exception: Operations on sub-lanes of 32-bit values: 2x16 or 4x8. This is relevant for graphics/video work, where you often have Red, Green, Blue, Alpha channels of 8 bits each. Their use is uncommon (AFAICT) in CUDA compute work.

shaklee3 4 days ago | parent | prev [-]

not true; there are a lot of simd instructions on GPUs

einpoklum 4 days ago | parent [-]

Such as?

shaklee3 4 days ago | parent [-]

dp4a, ldg. just Google it. there's a whole page of them