If you have access to the BMI2 instruction set I can do branchless UTF-8 encoding like in the article using only 9 instructions and 73 bytes of lookup tables:

    branchless_utf8:
        mov     rax, rdi
        lzcnt   ecx, esi
        lea     rdx, [rip + .L__unnamed_1]
        movzx   ecx, byte ptr [rcx + rdx]
        lea     rdx, [rip + example::DEP_AND_OR::h78cbe1dc7fe823a9]
        pdep    esi, esi, dword ptr [rdx + 8*rcx]
        or      esi, dword ptr [rdx + 8*rcx + 4]
        movbe   dword ptr [rdi], esi
        mov     qword ptr [rdi + 8], rcx
        ret

The code:

    static DEP_AND_OR: [(u32, u32); 5] = [
        (0, 0),
        (0b01111111_00000000_00000000_00000000, 0b00000000_00000000_00000000_00000000),
        (0b00011111_00111111_00000000_00000000, 0b11000000_10000000_00000000_00000000),
        (0b00001111_00111111_00111111_00000000, 0b11100000_10000000_10000000_00000000),
        (0b00000111_00111111_00111111_00111111, 0b11110000_10000000_10000000_10000000),
    ];

    const LEN: [u8; 33] = [
        // 0-10 leading zeros: not valid.
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        // 11-15 leading zeros: 4 bytes.
        4, 4, 4, 4, 4,
        // 16-20 leading zeros: 3 bytes.
        3, 3, 3, 3, 3,
        // 21-24 leading zeros: 2 bytes.
        2, 2, 2, 2,
        // 25-32 leading zeros: 1 byte.
        1, 1, 1, 1, 1, 1, 1, 1,
    ];

    pub unsafe fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
        let leading_zeros = codepoint.leading_zeros() as usize;
        let bytes = LEN[leading_zeros] as usize;
        let (mask, or) = *DEP_AND_OR.get_unchecked(bytes);
        let ret = core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
        (ret.swap_bytes().to_le_bytes(), bytes)
    }

▲ adrian_b 6 months ago | parent | next [-]

While Intel has LZCNT and TZCNT (leading-zero count and trailing-zero count), which replace the wrongly-defined BSR and BSF, only since Haswell (June 2013), AMD has LZCNT since Barcelona (September 2007) and TZCNT since Piledriver (May 2012).

The author has made the mistake of not using the right compilation options for the CPU, in order to enable the use of LZCNT and TZCNT, because it is very likely that the author uses a CPU that supports these instructions, unless it is an older Intel Atom CPU, up to Tremont.

Had the author compiled correctly the program, there should not have been any branches since the beginning.

When Intel has added the BSF and BSR instructions in 1985 to 80386, they have made a very serious mistake in their definition, despite the fact that they should have followed the example of much older ISAs, where these instructions were defined correctly.

AMD has defined LZCNT and TZCNT in order to correct Intel's mistake, but in order to ensure backward compatibility, the corrected instructions use an additional prefix that is ignored by older CPUs, instead of using a new encoding. This makes the encoding of these instructions much longer than it should be.

▲

hn3er1q 6 months ago | parent [-]

Interesting. What compiler options would you have used? Do you know if the options are applicable for ARM as well?

	▲	adrian_b 6 months ago \| parent [-]
		"-march=...", e.g. "-march=skylake" or "-march=znver3" or whatever CPU you are using. When you do not know the correct type, and you do not cross-compile, you can just use "-march=native". I recommend to never use either gcc or clang without giving an explicit "-march=..." option. Otherwise you do not know which is the default compilation target and it is almost certain that the quality of the generated code will be bad. Even for code that will be distributed in binary form for multiple computers, you must choose consciously a compilation target that is the minimum that should be supported and you must not depend on a random default choice that may change at each compiler update. For ARM, there are similar options for specifying the CPU model, e.g. "-march=armv8.2-a". However there are much fewer models and counting leading zeroes has been supported for about 20 years, so even the default ARM CPU model must include it. On ARM only the compiler options for specifying the floating-point instruction subset can be tricky (an example of such an option: "-mfpu=fpv4-sp-d16"), when you are targetting embedded computers or microcontrollers, because there you can still encounter a lot of older ARM cores that have a much more restricted floating-point support than in more recent models. Moreover, some vendors of ARM-based devices may choose to reduce the cost by disabling some floating-point features, so knowing the name of an ARM core may be not enough for the selection of the compiler options, you must also know whether the optional core synthesis options have been enabled or disabled.

▲ 6 months ago | parent | prev [-]

[deleted]