| ▲ | ack_complete 3 hours ago | |
There are also mode switching and calling convention issues. The way that the vector registers were extended to 256-bit causes problems when legacy 128-bit and 256-bit ops are mixed. Doing so puts the CPU into a mode where all legacy 128-bit ops are forced to blend the high half, which can reduce throughput of existing SSE2-based library routines to as low as 1/4 throughput. For this reason, AVX code has to aggressively use the VZEROUPPER instruction to ensure that the CPU is not left in AVX 256-bit vector mode before possibly returning to any library or external code that uses SSE2. VZEROUPPER sets a flag to zero the high half of all 256-bit registers, so it's cheap on modern x86 CPUs but can be expensive to emulate without hardware support. The other problem is that only the low 128 bits of vector registers are preserved across function calls due to the Windows x64 calling convention and the VZEROUPPER issue. This means that practically any call to external code forces the compiler to spill all AVX vectors to memory. Ideally 256-bit vector usage is concentrated in leaf routines so this isn't an issue, but where used in non-leaf routines, it can result in a lot of memory traffic. | ||