Very much agree re: inscrutability. It gets even more complicated when you add the LLM-specific concept of rotary positional embeddings to the mix. In my experience, it's been exceptionally hard to communicate that concept to even technical folks that may understand (at a high level) the concept of semantic similarity via something like cosine distance.

I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).

▲

krackers 4 days ago | parent [-]

I've read that "No Position Embedding" seems to be better for long-term context anyway, so it's probably not something essential to explain.

▲

spmurrayzzz 4 days ago | parent [-]

Do you have a citation for the paper on that? IME, that's not really something you see used in practice, at least not after 2022 or so. Without some form positional adjustment, transformer-based LLMs have no way to differentiate from "The dog bit the man." and "The man bit the dog." given the token ids are nearly identical. You just end up back in the bag-of-words problem space. The self-attention mechanism is permutation-invariant, so as long as it remains true that the attention scores are computed as an unordered set, you need some way to model the sequence.

Long context is almost always some form of RoPE in practice (often YaRN these days). We can't confirm this with the closed-source frontier models, but given that all the long context models in the open weight domain are absolutely encoding positional data, coupled with the fact that the majority of recent and past literature corroborates its use, we can be reasonably sure they're using some form of it there as well.

EDIT: there is a recent paper that addresses the sequence modeling problem in another way, but its somewhat orthogonal to the above as they're changing the tokenization method entirely https://arxiv.org/abs/2507.07955

▲

krackers 3 days ago | parent [-]

The paper showing that dropping positional encoding entirely is feasible is https://arxiv.org/pdf/2305.19466 . But I was misremembering as to its long context performance, Llama 4 does use NoPE but it's still interleaved with RoPE layers. Just an armchair commenter though, so I may well be wrong.

My intuition for NoPE was that the presence of the causal mask provides enough of a signal to implicitly distinguish token position. If you imagine the flow of information in the transformer network, tokens later on in the sequence "absorb" information from the hidden states of previous tokens, so in this sense you can imagine information flowing "down (depth) and to the right (token position)", and you could imagine the network learning a scheme to somehow use this property to encode position.

	▲	spmurrayzzz 3 days ago \| parent [-]
		Ah didn't realize you were referring to NoPE explicitly. And yea, the intuitions gained from that paper are pretty much what I alluded to above, you don't get away with never modeling the positional data, the question is how you model it effectively and from where do you derive that signal. NoPE never really took off more broadly in modern architecture implementations. We haven't seen anyone successfully reproduce the proposed solution to the long context problem presented in the paper (tuning the scaling factor in the attention softmax). There is a recent paper back in December[1] that talked about the idea of positional information arising from the similarity of nearby embeddings. Its again in that common research bucket of "never reproduced", but interesting. It does sound similar in spirit though to the NoPE idea you mentioned of the causal mask providing some amount of position signal. i.e. we don't necessarily need to adjust the embeddings explicitly for the same information to be learned (TBD on whether that proves out long term). This all goes back to my original comment though of communicating this idea to AI/ML neophytes being challenging. I don't think skipping the concept of positional information actually makes these systems easier to comprehend since its critically important to how we model language, but its also really complicated to explain in terms of implementation. [1] https://arxiv.org/abs/2501.00073