▲ | derf_ 3 days ago | |||||||
One time I spent a week carefully rewriting all of the SIMD asm in libtheora, really pulling out all of the stops to go after every last cycle [0], and managed to squeeze out 1% faster total decoder performance. Then I spent a day reorganizing some structs in the C code and got 7%. I think about that a lot when I decide what optimizations to go after. [0] https://gitlab.xiph.org/xiph/theora/-/blob/main/lib/x86/mmxl... is an example of what we are talking about here. | ||||||||
▲ | saagarjha 3 days ago | parent | next [-] | |||||||
Unfortunately modern processors do not work how most people think they do. Optimizing for less work for a nebulous idea of what "work" is generally loses to bad memory access patterns or just using better instructions that seem most expensive if you look at them superficially. | ||||||||
| ||||||||
▲ | magicalhippo 3 days ago | parent | prev [-] | |||||||
It can be sobering to consider how many instructions a modern CPU can execute in case of a cache miss. In the timespan of a L1 miss, the CPU could execute several dozen instructions assuming a L2 hit, hundreds if it needs to go to L3. No wonder optimizing memory access can work wonders. |