One time I spent a week carefully rewriting all of the SIMD asm in libtheora, really pulling out all of the stops to go after every last cycle [0], and managed to squeeze out 1% faster total decoder performance. Then I spent a day reorganizing some structs in the C code and got 7%. I think about that a lot when I decide what optimizations to go after.

[0] https://gitlab.xiph.org/xiph/theora/-/blob/main/lib/x86/mmxl... is an example of what we are talking about here.

▲

saagarjha 3 days ago | parent | next [-]

Unfortunately modern processors do not work how most people think they do. Optimizing for less work for a nebulous idea of what "work" is generally loses to bad memory access patterns or just using better instructions that seem most expensive if you look at them superficially.

	▲	astrange 13 hours ago \| parent [-]
		If you're important enough they'll design the next processor to run your code better anyway. (Or at least add new features specifically for you to adopt.)

▲

magicalhippo 3 days ago | parent | prev [-]

It can be sobering to consider how many instructions a modern CPU can execute in case of a cache miss.

In the timespan of a L1 miss, the CPU could execute several dozen instructions assuming a L2 hit, hundreds if it needs to go to L3.

No wonder optimizing memory access can work wonders.