I would not be surprised if it turned out the exact attention mechanism does not really matter, similarly to the sigmoid, ReLU, GELU movement, only the speed on calculation - and QKV is pretty good at that on the GPUs.