So the main issues here are not what people think they are. They generally aren't "suboptimal assembly", at least not what you can reasonably expect out of a C compiler.

The factors are something like:

- specialization: there's already a decent plain-C implementation of the loop, asm/SIMD versions are added on for specific hardware platforms. And different platforms have different SIMD features, so it's hard to generalize them.

- predictability: users have different compiler versions, so even if there is a good one out there not everyone is going to use it.

- optimization difficulties: C's memory model specifically makes optimization difficult here because video is `char *` and `char *` aliases everything. Also, the two kinds of features compilers add for this (intrinsics and autovectorization) can fight each other and make things worse than nothing.

- taste: you could imagine a better portable language for writing SIMD in, but C isn't it. And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything. The assembly is /more/ readable than C would be because it'd all be function calls with names like `_mm_movemask_epi8`.

▲

derf_ 3 days ago | parent | next [-]

One time I spent a week carefully rewriting all of the SIMD asm in libtheora, really pulling out all of the stops to go after every last cycle [0], and managed to squeeze out 1% faster total decoder performance. Then I spent a day reorganizing some structs in the C code and got 7%. I think about that a lot when I decide what optimizations to go after.

[0] https://gitlab.xiph.org/xiph/theora/-/blob/main/lib/x86/mmxl... is an example of what we are talking about here.

▲

saagarjha 3 days ago | parent | next [-]

Unfortunately modern processors do not work how most people think they do. Optimizing for less work for a nebulous idea of what "work" is generally loses to bad memory access patterns or just using better instructions that seem most expensive if you look at them superficially.

	▲	astrange 13 hours ago \| parent [-]
		If you're important enough they'll design the next processor to run your code better anyway. (Or at least add new features specifically for you to adopt.)

▲

magicalhippo 3 days ago | parent | prev [-]

It can be sobering to consider how many instructions a modern CPU can execute in case of a cache miss.

In the timespan of a L1 miss, the CPU could execute several dozen instructions assuming a L2 hit, hundreds if it needs to go to L3.

No wonder optimizing memory access can work wonders.

▲

ack_complete 3 days ago | parent | prev [-]

> And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything.

Wouldn't Intel be the one defining the intrinsics? They're referenced from the ISA manuals, and the Intel Intrinsics Guide regularly references intrinsics like _allow_cpu_features() that are only supported by the Intel compiler and aren't implemented in MSVC.

▲

astrange 3 days ago | parent [-]

The _emm _epi8 stuff is Hungarian notation, which is from Microsoft.

	▲	ack_complete 3 days ago \| parent [-]
		Uh, no, that's standard practice for disambiguating the intrinsic operations for different data types without overloading support. ARM does the same thing with their vector intrinsics, such as vaddq_u8(), vaddq_s16(), etc.