▲ | visarga 5 days ago | |
The resemblance is pretty good, they can't show all details because the diagram would be hard to see. But the essential parts are there. I find the model to be extremely simple, you can write the attention equation on a napkin. This is the core idea: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V The attention process itself is based on all-to-all similarity calculation Q * K |