Remix.run Logo
krackers 3 hours ago

I think there are two key differences though: 1) Attention doesn't doesn't use fixed distance-dependent weight for the aggregation but instead the weight becomes "semantically-dependent", based on association between q/k. 2) A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation, pulling from the hidden states of all previous tokens. (Maybe sliding window attention schemes muddy this distinction, but in general the degree of connectivity seems far higher).

There might be some unifying way to look at things though, maybe GNNs. I found this talk [1] and at 4:17 it shows how convolution and attention would be modeled in a GNN formalism

[1] https://www.youtube.com/watch?v=J1YCdVogd14

sifar an hour ago | parent [-]

Nested concolutikns, dilated convolutiona both can pull in data from further afar.