Or at runtime, if you'd like. You can create a generic binary that runs faster on supported platforms.

> Or at runtime, if you'd like

You have to be careful about how you do it because those runtime checks can easily swamp the performance gains you get from SIMD.

> also get the block size according to CPU features at compile time with `std.simd.suggestVectorSize()`

You have to be careful with this since std.simd.suggestVectorSize is going to return values for the minimum SIMD version you're targeting I believe which can be suboptimal for portable binaries.

You probably want a mix where you carefully compute the vector size for the current platform globally once and have multiple compiled dispatch paths in your binary that you can pick based on that value & let the CPU prefetcher hide the cost of a check before each invocation.

▲

stouset 5 days ago | parent [-]

> You have to be careful about how you do it because those runtime checks can easily swamp the performance gains you get from SIMD.

That seems surprising, particularly given that autovectorizing compilers tend to insert pretty extensive preambles that check for whether or not it's likely the vectorized one will have a speedup over the looping version (e.g., based on the number of iterations) and postambles that handle the cases where the number of loop iterations isn't cleanly divisible by the number of elements in the chosen vector size.

Why would checking for supported SIMD instructions cause that much additional work?

Also, even if this is the case, you can always check once and then replace the function body with the chosen one, eliding the check.

	▲	vlovich123 5 days ago \| parent [-]
		> Why would checking for supported SIMD instructions cause that much additional work? Because CPUID checks on x86 are expensive for whatever reason. > That seems surprising, particularly given that autovectorizing compilers tend to insert pretty extensive preambles that check for whether or not it's likely the vectorized one will have a speedup over the looping version (e.g., based on the number of iterations) and postambles that handle the cases where the number of loop iterations isn't cleanly divisible by the number of elements in the chosen vector size. Compilers can't elide those checks unless they are given specific flags that tell them the target CPU supports that specific instruction set OR they always just choose to target the minimum supported SIMD instruction set on the target CPU. They often emit suboptimal code for all sorts of reasons, this being one of them. > Also, even if this is the case, you can always check once and then replace the function body with the chosen one, eliding the check. Yes, but like I said, you have to do it very carefully to make sure you're calling CPUID once outside of a hot loop to initialize your decision making and then relying on the CPU's predictor to elide the cost of a boolean / switch statement in your code doing the actual dispatch.