Remix.run Logo
lacedeconstruct 2 hours ago

The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable

Sesse__ an hour ago | parent | next [-]

Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)

exyi an hour ago | parent | prev | next [-]

It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

Tuna-Fish an hour ago | parent | prev [-]

FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.

pixelesque 37 minutes ago | parent | next [-]

Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.

mgaunard an hour ago | parent | prev [-]

That's only valid to do if the reciprocal is representable exactly.