Remix.run Logo
dudu24 3 hours ago

If you have a ruler and it goes to 12 inches, you should normalize by the length L and not by 13, the number of points on the ruler.

Timwi an hour ago | parent | next [-]

I'm confused by that analogy. Is the “ruler” a 255-inch ruler with 256 points labeled 0–255, or is it a 256-inch ruler with 256 1-inch segments, making L = 256×1?

lacedeconstruct 3 hours ago | parent | prev | next [-]

yes but >> 8 is so much faster

xigoi 2 hours ago | parent | next [-]

You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.

dheera an hour ago | parent [-]

Same point; divide by power of 2 is a fast subtraction operation in float world, while divide by 255 shits all over the whole float

StilesCrisis 2 hours ago | parent | prev | next [-]

It's just multiplication. Floating multiply is extraordinarily fast.

lacedeconstruct 2 hours ago | parent [-]

The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable

Sesse__ an hour ago | parent | next [-]

Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)

exyi an hour ago | parent | prev | next [-]

It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

Tuna-Fish an hour ago | parent | prev [-]

FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.

pixelesque 38 minutes ago | parent | next [-]

Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.

mgaunard an hour ago | parent | prev [-]

That's only valid to do if the reciprocal is representable exactly.

dist-epoch 2 hours ago | parent | prev [-]

Only in micro-benchmarks.

For real usage, today's CPUs are limited by memory bandwidth.

lacedeconstruct 2 hours ago | parent | next [-]

What are you talking about in a hot loop in my software renderer this is like 10x faster

    // color4_t result = {
    //     .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
    //     .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
    //     .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
    //     .a = src.a + (dst.a * inv_alpha) * INV_255
    // };

    // 1/256 but much faster
    color4_t result = {
        .r = (src.r * src.a + dst.r * inv_alpha) >> 8,
        .g = (src.g * src.a + dst.g * inv_alpha) >> 8,
        .b = (src.b * src.a + dst.b * inv_alpha) >> 8,
        .a = src.a + ((dst.a * inv_alpha) >> 8)
    };
Tuna-Fish an hour ago | parent | next [-]

If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.

dist-epoch 2 hours ago | parent | prev [-]

Because you are working in the cache.

Also, you should use SIMD.

lacedeconstruct 2 hours ago | parent [-]

> Also, you should use SIMD. ironically no clang is better at auto vectorizing

szundi 2 hours ago | parent | prev [-]

[dead]

groundzeros2015 3 hours ago | parent | prev [-]

I’m dumb. Doesn’t 0 start at the beginning?