| ▲ | in-silico 9 days ago | |
This is really semantics, but I wouldn't call attending to the KV cache re-reading the context. The model takes in the context, encodes it into a "memory" (the KV cache), and accesses that memory later. That fact doesn't change just because the KV cache grows in size with the context. I don't know what memory would look like other than an encode-retrieve loop. Relevant: Transformers are Multi-State RNNs - https://arxiv.org/abs/2401.06104 | ||