| ▲ | pjmlp 4 hours ago | ||||||||||||||||
Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time. | |||||||||||||||||
| ▲ | fooker 3 hours ago | parent [-] | ||||||||||||||||
The comparison is often just plain old linear code. For example, one simd instruction vs multiple arithmetic instructions.
We have fifty years of CPU design optimizing for this. More often than not, you'll find this works better than vector instructions in practice.The concept behind vector instructions is great, and it starts to work out for larger widths like 512 bits. But it's extremely tricky to take advantage of that much SIMD with a compiler or manually. | |||||||||||||||||
| |||||||||||||||||