▲ | dzaima 20 hours ago | |||||||
Intel still shares ports between vector and scalar on P-cores; a scalar multiply in the loop will definitely fight with a vector port, and the bits of pointer bumps and branch and whatnot can fill up the 1 or 2 scalar-only ports. And maybe there are some minor power savings from wasting resources on the scalar overhead. Still, clang does unroll way too much. | ||||||||
▲ | Remnant44 19 hours ago | parent [-] | |||||||
My understanding is that they've changed this for Lion Cove and all future P cores, moving to much more of a Zen-like setup with seperate schedulers and ports for vector and scalar ops. | ||||||||
|