Remix clone Hacker News

new | show | ask | jobs Github

	▲	maho 6 hours ago
		The author only compared output token costs -- but for typical agentic workloads, input tokens dominate the costs by a large margin. Running inference locally, input tokens are, to first order, free. (They only generate implicit costs through higher time-to-first-token, higher power use, and lower token output speed).
	▲	amluto 2 hours ago \| parent \| next [-]
		Even ignoring superior caching on a local setup, Mac hardware can often process input token around 10x as quickly as they produce output tokens. Openrouter seems to have only a 2x difference on the same models.
	▲	Wilya 4 hours ago \| parent \| prev [-]
		Yeah, that completely invalidates his point. I looked at a couple random agentic sessions in my openrouter activity, and the input cost is 10x the output cost. Prompt caching on openrouter is complicated and unreliable. On local hardware with llama-cpp, it's mostly free.