|
| ▲ | xigoi 2 hours ago | parent | next [-] |
| You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow. |
| |
| ▲ | dheera an hour ago | parent [-] | | Same point; divide by power of 2 is a fast subtraction operation in float world, while divide by 255 shits all over the whole float |
|
|
| ▲ | StilesCrisis 2 hours ago | parent | prev | next [-] |
| It's just multiplication. Floating multiply is extraordinarily fast. |
| |
| ▲ | lacedeconstruct 2 hours ago | parent [-] | | The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable | | |
| ▲ | Sesse__ an hour ago | parent | next [-] | | Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.) | |
| ▲ | exyi an hour ago | parent | prev | next [-] | | It's 3 cycles for float multiplication (and 1 for shift right): https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on... https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c... In throughput it's even less of a difference: 2 per cycle vs 3 per cycle. | |
| ▲ | Tuna-Fish an hour ago | parent | prev [-] | | FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster. | | |
| ▲ | pixelesque 38 minutes ago | parent | next [-] | | Only with things like -ffast-math enabled will compilers do the reciprocal.
It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code. | |
| ▲ | mgaunard an hour ago | parent | prev [-] | | That's only valid to do if the reciprocal is representable exactly. |
|
|
|
|
| ▲ | dist-epoch 2 hours ago | parent | prev [-] |
| Only in micro-benchmarks. For real usage, today's CPUs are limited by memory bandwidth. |
| |
| ▲ | lacedeconstruct 2 hours ago | parent | next [-] | | What are you talking about in a hot loop in my software renderer this is like 10x faster // color4_t result = {
// .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
// .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
// .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
// .a = src.a + (dst.a * inv_alpha) * INV_255
// };
// 1/256 but much faster
color4_t result = {
.r = (src.r * src.a + dst.r * inv_alpha) >> 8,
.g = (src.g * src.a + dst.g * inv_alpha) >> 8,
.b = (src.b * src.a + dst.b * inv_alpha) >> 8,
.a = src.a + ((dst.a * inv_alpha) >> 8)
};
| | |
| ▲ | Tuna-Fish an hour ago | parent | next [-] | | If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies. | |
| ▲ | dist-epoch 2 hours ago | parent | prev [-] | | Because you are working in the cache. Also, you should use SIMD. | | |
| |
| ▲ | szundi 2 hours ago | parent | prev [-] | | [dead] |
|