Remix clone Hacker News

new | show | ask | jobs Github

	▲	red2awn 2 days ago
		Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.
	▲	whimsicalism 2 days ago \| parent [-]
		I imagine you have to start decoding many speculative completions in parallel to have true low latency.