| ▲ | lacedeconstruct 2 hours ago | |||||||||||||
The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable | ||||||||||||||
| ▲ | Sesse__ an hour ago | parent | next [-] | |||||||||||||
Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.) | ||||||||||||||
| ▲ | exyi an hour ago | parent | prev | next [-] | |||||||||||||
It's 3 cycles for float multiplication (and 1 for shift right): https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on... https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c... In throughput it's even less of a difference: 2 per cycle vs 3 per cycle. | ||||||||||||||
| ▲ | Tuna-Fish an hour ago | parent | prev [-] | |||||||||||||
FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster. | ||||||||||||||
| ||||||||||||||