| ▲ | profsummergig 8 hours ago |
| Haven't watched it yet... ...but, if you have favorite resources on understanding Q & K, please drop them in comments below... (I've watched the Grant Sanderson/3blue1brown videos [including his excellent talk at TNG Big Tech Day '24], but Q & K still escape me). Thank you in advance. |
|
| ▲ | roadside_picnic 7 hours ago | parent | next [-] |
| It's just a re-invention of kernel smoothing. Cosma Shalizi has an excellent write up on this [0]. Once you recognize this it's a wonderful re-framing of what a transformer is doing under the hood: you're effectively learning a bunch of sophisticated kernels (though the FF part) and then applying kernel smoothing in different ways through the attention layers. It makes you realize that Transformers are philosophically much closer to things like Gaussian Processes (which are also just a bunch of kernel manipulation). 0. http://bactra.org/notebooks/nn-attention-and-transformers.ht... |
|
| ▲ | leopd 8 hours ago | parent | prev | next [-] |
| I think this video does a pretty good job explaining it, starting about 10:30 minutes in: https://www.youtube.com/watch?v=S27pHKBEp30 |
| |
| ▲ | andoando 7 hours ago | parent | next [-] | | This wasn't any better than other explanation I've seen. | |
| ▲ | oofbey 8 hours ago | parent | prev [-] | | As the first comment says "This aged like fine wine". Six years old, but the fundamentals haven't changed. |
|
|
| ▲ | throw310822 7 hours ago | parent | prev | next [-] |
| Have you tried asking e.g. Claude to explain it to you? None of the usual resources worked for me, until I had a discussion with Claude where I could ask questions about everything that I didn't get. |
| |
|
| ▲ | machinationu 6 hours ago | parent | prev | next [-] |
| Q, K and V are a way of filtering the relevant aspects for the task at hand from the token embeddings. "he was red" - maybe color, maybe angry, the "red" token embedding carries both, but only one aspect is relevant for some particular prompt. https://ngrok.com/blog/prompt-caching/ |
|
| ▲ | red2awn 8 hours ago | parent | prev | next [-] |
| Implement transformers yourself (ie in Numpy). You'll never truly understand it by just watching videos. |
| |
| ▲ | D-Machine 8 hours ago | parent | next [-] | | Seconding this, the terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases, where x and y are are outputs from previous layers. Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations and/or multiplicative interactions among a dimension-reduced representation. | | |
| ▲ | tayo42 an hour ago | parent | next [-] | | >the terms "Query" and "Value" are largely arbitrary and meaningless in practice This is the most confusing thing about it imo. Those words all mean something but they're just more matrix multiplications. Nothing was being searched for. | | |
| ▲ | D-Machine 29 minutes ago | parent [-] | | Better resources will note the terms are just historical and not really relevant anymore, and just remain a naming convention for self-attention formulas. IMO it is harmful to learning and good pedagogy to say they are anything more than this, especially as we better understand the real thing they are doing is approximating feature-feature correlations / similarity matrices, or perhaps even more generally, just allow for multiplicative interactions (https://openreview.net/forum?id=rylnK6VtDH). |
| |
| ▲ | profsummergig 8 hours ago | parent | prev [-] | | Do you think the dimension reduction is necessary? Or is it just practical (due to current hardware scarcity)? | | |
| ▲ | D-Machine 40 minutes ago | parent [-] | | Definitely mostly just a practical thing IMO, especially with modern attention variants (sparse attention, FlashAttention, linear attention, merged attention etc). Not sure it is even hardware scarcity per se / solely, it would just be really expensive in terms of both memory and FLOPs (and not clearly increase model capacity) to use larger matrices. Also for the specific part where you, in code for encoder-decoder transformers, call the a(x, x, y) function instead of the usual a(x, x, x) attention call (what Alammar calls "encoder-decoder attention" in his diagram just before the "The Decoder Side"), you have different matrix sizes, so dimension reduction is needed to make the matrix multiplications work out nicely too. But in general it is just a compute thing IMO. |
|
| |
| ▲ | roadside_picnic 6 hours ago | parent | prev | next [-] | | I personally don't think implementation is as enlightening as far as really understanding what the model is doing as this statement implies. I had done that many times, but it wasn't until reading about the relationship to kernel methods that it really clicked for me what is really happening under the hood. Don't get me wrong, implementing attention is still great (and necessary), but even with something as simple as linear regression, implementing it doesn't really give you the entire conceptual model. I do think implementation helps to understand the engineering of these models, but it still requires reflection and study to start to understand conceptually why they are working and what they're really doing (I would, of course, argue I'm still learning about linear models in that regard!) | |
| ▲ | krat0sprakhar 7 hours ago | parent | prev [-] | | Do you have a tutorial that I can follow? | | |
| ▲ | jwitthuhn 5 hours ago | parent | next [-] | | If you have 20 hours to spare I highly recommend this youtube playlist from Andrej Karpathy
https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb... It starts with the fundamentals of how backpropagation works then advances to building a few simple models and ends with building a GPT-2 clone. It won't taech you everything about AI models but it gives you a solid foundation for branching out. | |
| ▲ | roadside_picnic 6 hours ago | parent | prev [-] | | The most valuable tutorial will be translating from the paper itself. The more hand holding you have in the process, the less you'll be learning conceptually. The pure manipulation of matrices is rather boring and uninformative without some context. I also think the implementation is more helpful for understanding the engineering work to run these models that getting a deeper mathematical understanding of what the model is doing. |
|
|
|
| ▲ | bobbyschmidd 7 hours ago | parent | prev [-] |
| tldr: recursively aggregating packing/unpacking 'if else if (functions)/statements' as keyword arguments that (call)/take them themselves as arguments, with their own position shifting according to the number "(weights)" of else if (functions)/statements needed to get all the other arguments into (one of) THE adequate orders. the order changes based on the language, input prompt and context. if I understand it all correctly. implemented it in html a while ago and might do it in htmx sometime soon. transformers are just slutty dictionaries that Papa Roach and kage bunshin no jutsu right away again and again, spawning clones and variations based on requirements, which is why they tend to repeat themselves rather quickly and often. it's got almost nothing to do with languages themselves and requirements and weights amount to playbooks and DEFCON levels |