| ▲ | aleph_minus_one 2 hours ago | |||||||
> And on x86, saturating addition can't be done in a tick Perhaps I misunderstand your point, but I am rather sure that in SSE.../AVX... there do exist instructions for saturating addition: * (V)PADDSB, (V)PADDSW, (V)PADDUSB, (V)PADDUSW * (V)PHADDSW, (V)PHSUBSW | ||||||||
| ▲ | dzaima an hour ago | parent | next [-] | |||||||
Unfortunately, that's only vector, and ≤16-bit ints at that, no 32-bit ints; and as the other reply says, nearly non-existent multiply-high which generally makes vectorized div-by-const its own mini-hell (but doing a 2x-width multiply with fixups is still better than the OP 4x-width method). (...though, x86 does have (v)pmulhw for 16-bit input, so for 16-bit div-by-const the saturating option works out quite well.) (And, for what it's worth, the lack of 8-bit multiplies on x86 means that the OP method of high-half-of-4x-width-multiply works out nicely for vectorizing dividing 8-bit ints too) | ||||||||
| ▲ | oxxoxoxooo an hour ago | parent | prev | next [-] | |||||||
On x86, there is no vector instruction to get the upper half of integer product (64-bits x 64-bits). ARM SVE2 and RISC-V RVV have one, x86 unfortunately does not (and probably wont for a long time as AVX10 does not add it, either). | ||||||||
| ||||||||
| ▲ | an hour ago | parent | prev [-] | |||||||
| [deleted] | ||||||||