Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time.

▲ fooker 3 hours ago | parent [-]

The comparison is often just plain old linear code.

For example, one simd instruction vs multiple arithmetic instructions.

  x1 += y1
  x2 += y2
  x3 += y3
  x4 += y4

We have fifty years of CPU design optimizing for this. More often than not, you'll find this works better than vector instructions in practice.

The concept behind vector instructions is great, and it starts to work out for larger widths like 512 bits. But it's extremely tricky to take advantage of that much SIMD with a compiler or manually.

▲

pjmlp 3 hours ago | parent [-]

Yet there are gains of doing e.g. string searches with SIMD, which you naturally aren't going to do in CUDA.

	▲	fooker 3 hours ago \| parent [-]
		For sure, it makes sense for nice well defined problems that execute in isolation. Think of the situation where the string search is running on a system that has hyper threading and a bunch of cores, and a normal amount of memory bandwidth. It'll be faster, but at the same time make everything else worse if you overuse vector instructions. (also cherry on top: some modern CPUs automagically lower the clock when they encounter vector instructions!!!)