Remix clone Hacker News

new | show | ask | jobs Github

	▲	danielbln 5 hours ago
		Do keep in mind that 1 large prompt every 5 minutes is not how e.g. coding agents are used. There it's 1 large prompt every couple of seconds.
	▲	keeda 4 hours ago \| parent [-]
		True, but I think in these scenarios they rely on prompt caching, which is much cheaper: https://ngrok.com/blog/prompt-caching/ I have no expertise here, but a couple years ago I had a prototype using locally deployed Llama 2 that cached the context (now deprecated https://github.com/ollama/ollama/issues/10576) from previous inference calls, and reused it for subsequent calls. The subsequent calls were much much faster. I suspect prompt caching works similarly, especially given changed code is very small compered to the rest of the codebase.