Remix.run Logo
visarga 5 days ago

The resemblance is pretty good, they can't show all details because the diagram would be hard to see. But the essential parts are there.

I find the model to be extremely simple, you can write the attention equation on a napkin.

This is the core idea:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

The attention process itself is based on all-to-all similarity calculation Q * K