| ▲ | kang 5 hours ago | |
You not only skipped the diligence but confused everyone repeating what I said :( that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt). The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens. | ||