I'm surprised and a little disappointed by the result concerning instructions at the top, because it's incompatible with prompt caching: I would much rather cache the part of the prompt that includes the long document and then swap out the user question at the end.

▲ mmoskal 3 months ago | parent | next [-]

The way I understand it: if the instruction are at the top, the KV entries computed for "content" can be influenced by the instructions - the model can "focus" on what you're asking it to do and perform some computation, while it's "reading" the content. Otherwise, you're completely relaying on attention to find the information in the content, leaving it much less token space to "think".

▲ zaptrem 3 months ago | parent | prev | next [-]

Prompt on bottom is also easier for humans to read as I can have my actual question and the model’s answer on screen at the same time instead of scrolling through 70k tokens of context between them.

▲ jeeeb 3 months ago | parent | prev | next [-]

Wouldn’t it be the other way around?

If the instructions are at the top the LV cache entries can be pre computed and cached.

If they’re at the bottom the entries at the lower layers will have a dependency on the user input.

▲ a2128 3 months ago | parent [-]

It's placing instructions AND user query at top and bottom. So if you have a prompt like this:

    [Long system instructions - 200 tokens]
    [Very long document for reference - 5000 tokens]
    [User query - 32 tokens]

The key-values for first 5200 tokens can be cached and it's efficient to swap out the user query for a different one, you only need to prefill 32 tokens and generate output.

But the recommendation is to use this, where in this case you can only cache the first 200 tokens and need to prefill 5264 tokens every time the user submits a new query.

    [Long system instructions - 200 tokens]
    [User query - 32 tokens]
    [Very long document for reference - 5000 tokens]
    [Long system instructions - 200 tokens]
    [User query - 32 tokens]

	▲	jeeeb 3 months ago \| parent [-]
		Ahh I see. Thank you for the explanation. I didn’t realise their was user input straight after the system prompt.

▲ swyx 3 months ago | parent | prev [-]

yep. we address it in the podcast. presumably this is just a recent discovery and can be post-trained away.

	▲	aoeusnth1 3 months ago \| parent [-]
		If you're skimming a text to answer a specific question, you can go a lot faster than if you have to memorize the text well enough to answer an unknown question after the fact.