> I code with SIMD as the target, and have special containers that pad memory to SIMD width...

I think this may be domain-specific. I help maintain several open-source audio libraries, and wind up being the one to review the patches when people contribute SIMD for some specific ISA, and I think without exception they always get the tail handling wrong. Due to other interactions it cannot always be avoided by padding. It can roughly double the complexity of the code [0], and requires a disproportionate amount of thinking time vs. the time the code spends running, but if you don't spend that thinking time you can get OOB reads or writes, and thus CVEs. Masked loads/stores are an improvement, but not universally available. I don't have a lot of concrete suggestions.

I also work with a lot of image/video SIMD, and this is just not a problem, because most operations happen on fixed block sizes, and padding buffers is easy and routine.

I agree I would have picked other things for the other two in my own top-3 list.

[0] Here is a fun one, which actually performs worst when len is a multiple of 8 (which it almost always is), and has 59 lines of code for tail handling vs. 33 lines for the main loop: https://gitlab.xiph.org/xiph/opus/-/blob/main/celt/arm/celt_...

▲ jandrewrogers 2 months ago | parent | next [-]

> Masked loads/stores are an improvement, but not universally available.

Traditionally we’ve worked around this with pretty idiomatic hacks that efficiently implement “masked load” functionality in SIMD ISAs that don’t have them. We could probably be better about not making people write this themselves every time.

▲ codedokode 2 months ago | parent | prev | next [-]

I think that SIMD code should not be written by hand but rather in a high-level language and so dealing with tail becomes a compiler's and not a programmer's problem. Or people still prefer to write assembly be hand? It seems to be so judging by the code you link.

What I wanted is to write code in a more high-level language like this. For example, to compute a scalar product of a and b you write:

    1..n | a[$1] * b[$1] | sum

Or maybe this:

    x = sum for i in 1 .. n: a[i] * b[i]

And the code gets automatically compiled into SIMD instructions for every existing architecture (and for large arrays, into a multi-thread computation).

	▲	Zambyte 2 months ago \| parent \| next [-]
		Zig exposes a Vector type to use for SIMD instructions, which has been my first introduction to SIMD directly. Reading through this thread I was immediately mapping what people were saying to Vector operations in Zig. It seems to me like SIMD can reasonably be exposed in high level languages for programmers to reach to in contexts where it matters. Of course, the compiler vectorizing code when it can as a general optimization is still useful, but when it's critical that some operations must be vectorized, explicit SIMD structures seem nice to have.
	▲	thierry_src 2 months ago \| parent \| prev [-]
		An old attempt to this was SWAR-C, (SIMD within a register - C), that could target Neon, altivec and MMX/SSE. I think SWAR-C nailed the syntax (a vector ?: operator, for example). (https://aggregate.ece.engr.uky.edu/SWAR/Swarc/Scc.html)

▲ ack_complete 2 months ago | parent | prev | next [-]

It depends on how integrated your SIMD strategy is into the overall technical design. Tail handling is much easier if you can afford SIMD-friendly padding so a full vector load/store is possible even if you have to manually mask. That avoids a lot of the hassle of breaking down memory accesses just to avoid a page fault or setting off the memory checker.

Beyond that -- unit testing. I don't see enough of it for vectorized routines. SIMD widths are small enough that you can usually just test all possible offsets right up against a guard page and brute force verify that no overruns occur.

▲ 2 months ago | parent | prev [-]

[deleted]