▲ | simonw 6 days ago | ||||||||||||||||
I'm surprised and a little disappointed by the result concerning instructions at the top, because it's incompatible with prompt caching: I would much rather cache the part of the prompt that includes the long document and then swap out the user question at the end. | |||||||||||||||||
▲ | mmoskal 6 days ago | parent | next [-] | ||||||||||||||||
The way I understand it: if the instruction are at the top, the KV entries computed for "content" can be influenced by the instructions - the model can "focus" on what you're asking it to do and perform some computation, while it's "reading" the content. Otherwise, you're completely relaying on attention to find the information in the content, leaving it much less token space to "think". | |||||||||||||||||
▲ | zaptrem 6 days ago | parent | prev | next [-] | ||||||||||||||||
Prompt on bottom is also easier for humans to read as I can have my actual question and the model’s answer on screen at the same time instead of scrolling through 70k tokens of context between them. | |||||||||||||||||
▲ | jeeeb 6 days ago | parent | prev | next [-] | ||||||||||||||||
Wouldn’t it be the other way around? If the instructions are at the top the LV cache entries can be pre computed and cached. If they’re at the bottom the entries at the lower layers will have a dependency on the user input. | |||||||||||||||||
| |||||||||||||||||
▲ | swyx 6 days ago | parent | prev [-] | ||||||||||||||||
yep. we address it in the podcast. presumably this is just a recent discovery and can be post-trained away. | |||||||||||||||||
|