| ▲ | mbowcut2 5 days ago |
| The problem with embeddings is that they're basically inscrutable to anything but the model itself. It's true that they must encode the semantic meaning of the input sequence, but the learning process compresses it to the point that only the model's learned decoder head knows what to do with it. Anthropic's developed interpretable internal features for Sonnet 3 [1], but from what I understand that requires somewhat expensive parallel training of a network whose sole purpose is attempt to disentangle LLM hidden layer activations. [1] https://transformer-circuits.pub/2024/scaling-monosemanticit... |
|
| ▲ | spmurrayzzz 5 days ago | parent | next [-] |
| Very much agree re: inscrutability. It gets even more complicated when you add the LLM-specific concept of rotary positional embeddings to the mix. In my experience, it's been exceptionally hard to communicate that concept to even technical folks that may understand (at a high level) the concept of semantic similarity via something like cosine distance. I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far). |
| |
| ▲ | krackers 4 days ago | parent [-] | | I've read that "No Position Embedding" seems to be better for long-term context anyway, so it's probably not something essential to explain. | | |
| ▲ | spmurrayzzz 4 days ago | parent [-] | | Do you have a citation for the paper on that? IME, that's not really something you see used in practice, at least not after 2022 or so. Without some form positional adjustment, transformer-based LLMs have no way to differentiate from "The dog bit the man." and "The man bit the dog." given the token ids are nearly identical. You just end up back in the bag-of-words problem space. The self-attention mechanism is permutation-invariant, so as long as it remains true that the attention scores are computed as an unordered set, you need some way to model the sequence. Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well. EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely https://arxiv.org/abs/2507.07955 | | |
| ▲ | krackers 3 days ago | parent [-] | | The paper showing that dropping positional encoding entirely is feasible is https://arxiv.org/pdf/2305.19466 . But I was misremembering as to its long context performance, Llama 4 does use NoPE but it's still interleaved with RoPE layers. Just an armchair commenter though, so I may well be wrong. My intuition for NoPE was that the presence of the causal mask provides enough of a signal to implicitly distinguish token position. If you imagine the flow of information in the transformer network, tokens later on in the sequence "absorb" information from the hidden states of previous tokens, so in this sense you can imagine information flowing "down (depth) and to the right (token position)", and you could imagine the network learning a scheme to somehow use this property to encode position. | | |
| ▲ | spmurrayzzz 3 days ago | parent [-] | | Ah didn't realize you were referring to NoPE explicitly. And yea, the intuitions gained from that paper are pretty much what I alluded to above, you don't get away with never modeling the positional data, the question is how you model it effectively and from where do you derive that signal. NoPE never really took off more broadly in modern architecture implementations. We haven't seen anyone successfully reproduce the proposed solution to the long context problem presented in the paper (tuning the scaling factor in the attention softmax). There is a recent paper back in December[1] that talked about the idea of positional information arising from the similarity of nearby embeddings. Its again in that common research bucket of "never reproduced", but interesting. It does sound similar in spirit though to the NoPE idea you mentioned of the causal mask providing some amount of position signal. i.e. we don't necessarily need to adjust the embeddings explicitly for the same information to be learned (TBD on whether that proves out long term). This all goes back to my original comment though of communicating this idea to AI/ML neophytes being challenging. I don't think skipping the concept of positional information actually makes these systems easier to comprehend since its critically important to how we model language, but its also really complicated to explain in terms of implementation. [1] https://arxiv.org/abs/2501.00073 |
|
|
|
|
|
| ▲ | gbacon 5 days ago | parent | prev | next [-] |
| I found decent results using multiclass spectral clustering to query embedding spaces. https://ieeexplore.ieee.org/document/10500152 https://ieeexplore.ieee.org/document/10971523 |
|
| ▲ | kianN 4 days ago | parent | prev | next [-] |
| This is exactly the challenge. When embedding were first popularized in word to vec they were interpretable because the word2vec model was revealed to be a batched matrix factorization [1]. LLM embedding are so abstract and far removed from a human interpretable or statistical corollary that even as the embeddings contain more information, that information becomes less accessible to humans. [1] https://papers.nips.cc/paper_files/paper/2014/hash/b78666971... |
|
| ▲ | gavmor 4 days ago | parent | prev | next [-] |
| > learned decoder head That's a really interesting three-word noun-phrase. Is it a term-of-art, or a personal analogy? |
|
| ▲ | TZubiri 5 days ago | parent | prev | next [-] |
| Can't you decode the embeddings to tokens for debugging? |
| |
| ▲ | freeone3000 5 days ago | parent [-] | | You can but this is lossy (as it drops context; it’s a dimensionality reduction from 512 or 1024 to a few bytes) and non-reconvertible. |
|
|
| ▲ | samrus 5 days ago | parent | prev | next [-] |
| I mean thats true for all DL layers, but we talk about convolutions and stuff often enough. Embedding are relatively new but theres not alot of discussion as to how crazy they are, especially given that they are the real star of the LLM, with transformers being a close second imo |
|
| ▲ | visarga 5 days ago | parent | prev [-] |
| You can search the closest matching words or expressions in a dictionary. It is trivial to understand where an embedding points to. |
| |
| ▲ | hangonhn 5 days ago | parent [-] | | Can you do that in the middle of the layers? And if you do, would that word be that meaningful to the final output? Genuinely curious. | | |
| ▲ | mbowcut2 5 days ago | parent [-] | | You can, and there has been some interesting work done with it. The technique is called LogitLens, and basically you pass intermediate embeddings through the LMHead to get logits corresponding to tokens. In this paper they use it to investigate whether LLMs have a language bias, i.e. does GPT "think" in English? https://arxiv.org/pdf/2408.10811 One problem with this technique is that the model wasn't trained with intermediate layers being mapped to logits in the first place, so it's not clear why the LMHead should be able to map them to anything sensible. But alas, like everything in DL research, they threw science at the wall and a bit stuck. |
|
|