Remix.run Logo
derf_ 3 days ago

One time I spent a week carefully rewriting all of the SIMD asm in libtheora, really pulling out all of the stops to go after every last cycle [0], and managed to squeeze out 1% faster total decoder performance. Then I spent a day reorganizing some structs in the C code and got 7%. I think about that a lot when I decide what optimizations to go after.

[0] https://gitlab.xiph.org/xiph/theora/-/blob/main/lib/x86/mmxl... is an example of what we are talking about here.

saagarjha 3 days ago | parent | next [-]

Unfortunately modern processors do not work how most people think they do. Optimizing for less work for a nebulous idea of what "work" is generally loses to bad memory access patterns or just using better instructions that seem most expensive if you look at them superficially.

astrange 13 hours ago | parent [-]

If you're important enough they'll design the next processor to run your code better anyway.

(Or at least add new features specifically for you to adopt.)

magicalhippo 3 days ago | parent | prev [-]

It can be sobering to consider how many instructions a modern CPU can execute in case of a cache miss.

In the timespan of a L1 miss, the CPU could execute several dozen instructions assuming a L2 hit, hundreds if it needs to go to L3.

No wonder optimizing memory access can work wonders.