▲ | remexre 3 days ago | |
That's a really anthropomorphizing description; a more mechanical one might be, The attention mechanism that transformers use to find information in the context is, in its simplest form, O(n^2); for each token position, the model considers whether relevant information has been produced at the position of every other token. To preserve performance when really long contexts are used, current-generation LLMs use various ways to consider fewer positions in the context; for example, they might only consider the 4096 "most likely" places to matter (de-emphasizing large numbers of "subtle hints" that something isn't correct), or they might have some way of combining multiple tokens worth of information into a single value (losing some fine detail). |