Remix clone Hacker News

new | show | ask | jobs Github

	▲	kang 5 hours ago
		You not only skipped the diligence but confused everyone repeating what I said :( that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt). The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.