Additionally there is still too much performance left on the table by not properly using CPU vector units.

▲ fooker 2 hours ago | parent [-]

SIMD performance in modern Intel and AMD cpus is so bad that it is useless outside very specific circumstances.

This is mainly because vector instructions are implemented by sharing resources with other parts of the CPU and more or less stalls pipelines, significantly reduces ipc, makes out of order execution ineffective.

The shared resources are often involve floating point registers and compute, so it's a double whammy.

▲ pjmlp 2 hours ago | parent [-]

Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time.

▲ fooker an hour ago | parent [-]

The comparison is often just plain old linear code.

For example, one simd instruction vs multiple arithmetic instructions.

  x1 += y1
  x2 += y2
  x3 += y3
  x4 += y4

We have fifty years of CPU design optimizing for this. More often than not, you'll find this works better than vector instructions in practice.

The concept behind vector instructions is great, and it starts to work out for larger widths like 512 bits. But it's extremely tricky to take advantage of that much SIMD with a compiler or manually.

▲

pjmlp an hour ago | parent [-]

Yet there are gains of doing e.g. string searches with SIMD, which you naturally aren't going to do in CUDA.

	▲	fooker an hour ago \| parent [-]
		For sure, it makes sense for nice well defined problems that execute in isolation. Think of the situation where the string search is running on a system that has hyper threading and a bunch of cores, and a normal amount of memory bandwidth. It'll be faster, but at the same time make everything else worse if you overuse vector instructions. (also cherry on top: some modern CPUs automagically lower the clock when they encounter vector instructions!!!)