▲ | derf_ a day ago | |||||||
> I code with SIMD as the target, and have special containers that pad memory to SIMD width... I think this may be domain-specific. I help maintain several open-source audio libraries, and wind up being the one to review the patches when people contribute SIMD for some specific ISA, and I think without exception they always get the tail handling wrong. Due to other interactions it cannot always be avoided by padding. It can roughly double the complexity of the code [0], and requires a disproportionate amount of thinking time vs. the time the code spends running, but if you don't spend that thinking time you can get OOB reads or writes, and thus CVEs. Masked loads/stores are an improvement, but not universally available. I don't have a lot of concrete suggestions. I also work with a lot of image/video SIMD, and this is just not a problem, because most operations happen on fixed block sizes, and padding buffers is easy and routine. I agree I would have picked other things for the other two in my own top-3 list. [0] Here is a fun one, which actually performs worst when len is a multiple of 8 (which it almost always is), and has 59 lines of code for tail handling vs. 33 lines for the main loop: https://gitlab.xiph.org/xiph/opus/-/blob/main/celt/arm/celt_... | ||||||||
▲ | jandrewrogers 20 hours ago | parent | next [-] | |||||||
> Masked loads/stores are an improvement, but not universally available. Traditionally we’ve worked around this with pretty idiomatic hacks that efficiently implement “masked load” functionality in SIMD ISAs that don’t have them. We could probably be better about not making people write this themselves every time. | ||||||||
▲ | codedokode 12 hours ago | parent | prev | next [-] | |||||||
I think that SIMD code should not be written by hand but rather in a high-level language and so dealing with tail becomes a compiler's and not a programmer's problem. Or people still prefer to write assembly be hand? It seems to be so judging by the code you link. What I wanted is to write code in a more high-level language like this. For example, to compute a scalar product of a and b you write:
Or maybe this:
And the code gets automatically compiled into SIMD instructions for every existing architecture (and for large arrays, into a multi-thread computation). | ||||||||
| ||||||||
▲ | ack_complete 16 hours ago | parent | prev | next [-] | |||||||
It depends on how integrated your SIMD strategy is into the overall technical design. Tail handling is much easier if you can afford SIMD-friendly padding so a full vector load/store is possible even if you have to manually mask. That avoids a lot of the hassle of breaking down memory accesses just to avoid a page fault or setting off the memory checker. Beyond that -- unit testing. I don't see enough of it for vectorized routines. SIMD widths are small enough that you can usually just test all possible offsets right up against a guard page and brute force verify that no overruns occur. | ||||||||
▲ | 12 hours ago | parent | prev [-] | |||||||
[deleted] |