Cortex A8 vsub reads the second source register a cycle earlier than veor, so that can add one cycle latency
Not scalar, but still sub vs xor. Though you’d use vmov immediate for zeroing anyway.