▲ | jsenn 2 days ago | |
This doesn’t answer your question, but one thing to keep in mind is that past the very first layer, every “token” position is a weighted average of every previous position, so adjacency isn’t necessarily related to adjacent input tokens. A borderline tautological answer might be “because the network learns that putting related things next to each other increases the usefulness of the convolutions” |