| ▲ | danielbln 5 hours ago | |
Do keep in mind that 1 large prompt every 5 minutes is not how e.g. coding agents are used. There it's 1 large prompt every couple of seconds. | ||
| ▲ | keeda 4 hours ago | parent [-] | |
True, but I think in these scenarios they rely on prompt caching, which is much cheaper: https://ngrok.com/blog/prompt-caching/ I have no expertise here, but a couple years ago I had a prototype using locally deployed Llama 2 that cached the context (now deprecated https://github.com/ollama/ollama/issues/10576) from previous inference calls, and reused it for subsequent calls. The subsequent calls were much much faster. I suspect prompt caching works similarly, especially given changed code is very small compered to the rest of the codebase. | ||