Remix clone Hacker News

The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418

The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc.

▲

cztomsik a day ago | parent [-]

hm, residual is what I would not expect, can you elaborate why?

	▲	simsla a day ago \| parent [-]
		Avoids vanishing gradients in deeper networks. Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.