Remix.run Logo
TZubiri 2 hours ago

I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.

Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.

To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.

This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.

So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.

NitpickLawyer an hour ago | parent | next [-]

While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.

> the time it takes to generate the Millionth output token is the same as the first output token.

This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.

> cached input tokens are almost virtually free naturally

No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.

Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.

Now consider 100k users doing basically this, all day long. This is not free and can't become free.

mike_hearn 18 minutes ago | parent | prev | next [-]

GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.

2001zhaozhao an hour ago | parent | prev | next [-]

Caching might be free, but I think making caching cost nothing at the API level is not a great idea either considering that LLM attention is currently more expensive with more tokens in context.

Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.

eshaham78 2 hours ago | parent | prev [-]

This matches my experience running coding agents at scale. The cached token pricing is indeed somewhat artificial - in practice, for agent workflows with repeated context (like reading the same codebase across multiple tasks), you can achieve near-zero input costs through strategic caching. The real cost optimization isn't just token pricing but minimizing the total tokens flowing through the loop through better tool design.

2001zhaozhao an hour ago | parent [-]

Are you hosting your own infrastructure for coding agents? At least from first glance, sharing actual codebase context across compacts / multiple tasks seems pretty hard to pull off with good cost-benefit unless you have vertical integration from the inference all the way to the coding agent harness.

I'm saying this because the current external LLM providers like OpenAI tend to charge quite a bit for longer-term caching, plus the 0.1x cache read cost multiplied by # LLM calls, so I doubt context sharing would actually be that beneficial considering you won't need all the repeated context every time, so caching context results in longer context for each agentic task which might increase API costs by more overall than you save by caching.