People need to get away from this idea of Key/Query/Value as being special.

Whereas a standard deep layer in a network is matrix * input, where each row of the matrix is the weights of the particular neuron in the next layer, a transformer is basically input* MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just simply more dimensions in a layer.

And consequently, you can represent the entire transformer architecture with a set of deep layers as you unroll the matricies, with a lot of zeros for the multiplication pieces that are not needed.

This is a fairly complex blog but it shows that its just all matrix multiplication all the way down. https://pytorch.org/blog/inside-the-matrix/.

▲

throw310822 7 hours ago | parent [-]

I might be completely off road, but I can't help thinking of convolutions as my mental model for the K Q V mechanism. Attention has the same property of a convolution kernel of being trained independently of position; it learns how to translate a large, rolling portion of an input to a new "digested" value; and you can train multiple ones in parallel so that they learn to focus on different aspects of the input ("kernels" in the case of convolution, "heads" in the case of attention).

▲

krackers 3 hours ago | parent [-]

I think there are two key differences though: 1) Attention doesn't doesn't use fixed distance-dependent weight for the aggregation but instead the weight becomes "semantically-dependent", based on association between q/k. 2) A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation, pulling from the hidden states of all previous tokens. (Maybe sliding window attention schemes muddy this distinction, but in general the degree of connectivity seems far higher).

There might be some unifying way to look at things though, maybe GNNs. I found this talk [1] and at 4:17 it shows how convolution and attention would be modeled in a GNN formalism

[1] https://www.youtube.com/watch?v=J1YCdVogd14

	▲	sifar 2 hours ago \| parent [-]
		Nested concolutikns, dilated convolutiona both can pull in data from further afar.