> so instead it's piecemeal implementations mostly in numeric packages like eigen and lapack.
Because that’s where the user-noticeable gains can be made. Using popcount in code you run once is going to shave off, maybe, 100 cycles. That isn’t worth the extra cycles of that approach.
Also, FTA: “and arguably the whole scheme should be replaced by finer-grained feature detection”. Such feature detection would lead to a combinatorial explosion of different binaries.
Finally, where it really matters, it’s not only a matter of recompiling the same code. For optimal performance, you also want to change loop unrolling strategy, stride count, etc.