If you have a ruler and it goes to 12 inches, you should normalize by the length L and not by 13, the number of points on the ruler.

▲ Timwi an hour ago | parent | next [-]

I'm confused by that analogy. Is the “ruler” a 255-inch ruler with 256 points labeled 0–255, or is it a 256-inch ruler with 256 1-inch segments, making L = 256×1?

▲ lacedeconstruct 3 hours ago | parent | prev | next [-]

yes but >> 8 is so much faster

▲ xigoi 2 hours ago | parent | next [-]

You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.

	▲	dheera an hour ago \| parent [-]
		Same point; divide by power of 2 is a fast subtraction operation in float world, while divide by 255 shits all over the whole float

▲ StilesCrisis 2 hours ago | parent | prev | next [-]

It's just multiplication. Floating multiply is extraordinarily fast.

▲

lacedeconstruct 2 hours ago | parent [-]

The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable

▲

Sesse__ an hour ago | parent | next [-]

Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)

▲

exyi an hour ago | parent | prev | next [-]

It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

▲

Tuna-Fish an hour ago | parent | prev [-]

FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.

	▲	pixelesque 38 minutes ago \| parent \| next [-]
		Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.
	▲	mgaunard an hour ago \| parent \| prev [-]
		That's only valid to do if the reciprocal is representable exactly.

▲ dist-epoch 2 hours ago | parent | prev [-]

Only in micro-benchmarks.

For real usage, today's CPUs are limited by memory bandwidth.

▲ lacedeconstruct 2 hours ago | parent | next [-]

What are you talking about in a hot loop in my software renderer this is like 10x faster

    // color4_t result = {
    //     .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
    //     .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
    //     .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
    //     .a = src.a + (dst.a * inv_alpha) * INV_255
    // };

    // 1/256 but much faster
    color4_t result = {
        .r = (src.r * src.a + dst.r * inv_alpha) >> 8,
        .g = (src.g * src.a + dst.g * inv_alpha) >> 8,
        .b = (src.b * src.a + dst.b * inv_alpha) >> 8,
        .a = src.a + ((dst.a * inv_alpha) >> 8)
    };

▲

Tuna-Fish an hour ago | parent | next [-]

If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.

▲

dist-epoch 2 hours ago | parent | prev [-]

Because you are working in the cache.

Also, you should use SIMD.

	▲	lacedeconstruct 2 hours ago \| parent [-]
		> Also, you should use SIMD. ironically no clang is better at auto vectorizing

▲ szundi 2 hours ago | parent | prev [-]

[dead]

▲ groundzeros2015 3 hours ago | parent | prev [-]

I’m dumb. Doesn’t 0 start at the beginning?