Remix.run Logo
dbeardsl 6 hours ago

I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

I feel like that is a choice best left up to users.

i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"

kiratp 17 minutes ago | parent | next [-]

By caching they mean “cached in GPU memory”. That’s a very very scarce resource.

Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.

Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)

giwook 4 hours ago | parent | prev | next [-]

Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).

Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.

Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.

jimkleiber an hour ago | parent | next [-]

I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.

Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.

sharts 3 hours ago | parent | prev [-]

That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?

jeremyjh 3 hours ago | parent | next [-]

Because it significantly increases actual costs for Anthropic.

If they ignored this then all users who don’t do this much would have to subsidize the people who do.

cadamsdotcom an hour ago | parent | prev [-]

Sure, it wouldn’t make sense if they only had one customer to serve :)

JumpCrisscross 6 hours ago | parent | prev | next [-]

> I was never under the impression that gaps in conversations would increase costs

The UI could indicate this by showing a timer before context is dumped.

karsinkk 6 hours ago | parent | next [-]

Yes!! A UI widget that shows how far along on the prompt cache eviction timelines we are would be great.

vyr 2 hours ago | parent | prev | next [-]

a countdown clock telling you that you should talk to the model again before your streak expires? that's the kind of UX i'd expect from an F2P mobile game or an abandoned shopping cart nag notification

abustamam 2 hours ago | parent [-]

Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.

No need to gamify it. It's just UI.

jimkleiber an hour ago | parent | prev [-]

I tried to hack the statusline to show this but when i tried, i don't think the api gave that info. I'd love if they let us have more variables to access in the statusline.

computably 6 hours ago | parent | prev | next [-]

> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.

doesnt_know 5 hours ago | parent | next [-]

How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

tempest_ an hour ago | parent [-]

I use CC, and I understand what caching means.

I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.

solarkraft 6 hours ago | parent | prev | next [-]

I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

mpyne 4 hours ago | parent [-]

> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

Does mmap(2) educate the developer on how disk I/O works?

At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.

websap an hour ago | parent | next [-]

Does using print() in Python means I need to understand the Kernel? This is an absurd thought.

zem 3 hours ago | parent | prev [-]

mmap(2) and all its underlying machinery are open source and well documented besides.

mpyne 3 hours ago | parent [-]

There are open-source and even open-weight models that operate in exactly this way (as it's based off of years of public research), and even if there weren't the way that LLMs generate responses to inputs is superbly documented.

Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.

It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.

margalabargala 5 hours ago | parent | prev | next [-]

Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.

someguyiguess 6 hours ago | parent | prev | next [-]

Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.

jghn 2 hours ago | parent | next [-]

To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.

So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.

abustamam 2 hours ago | parent [-]

> users should be curious and actively attempting to understand how it works

Have you ever talked with users?

> this is an endless job

Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.

Octoth0rpe an hour ago | parent [-]

There are general users of the average SaaS, and there are claude code users. There's no doubt in my mind that our expectations should be somewhat higher for CC users re: memory. I'm personally not completely convinced that cache eviction should be part of their thought process while using CC, but it's not _that_ much of a stretch.

coldtea 4 hours ago | parent | prev [-]

It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!

esafak 4 hours ago | parent [-]

They have to know that this could bite them and to ask the question first.

nixpulvis 4 hours ago | parent [-]

I do think having some insight into the current state of the cache and a realistic estimate for prompt token use is something we should demand.

exac 4 hours ago | parent | prev | next [-]

It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.

kovek 5 hours ago | parent | prev | next [-]

What if the cache was backed up to cold storage? Instead of having to recompute everything.

bontaq 4 hours ago | parent | prev | next [-]

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

jannyfer 3 hours ago | parent [-]

I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:

https://blog.exe.dev/expensively-quadratic

kang 5 hours ago | parent | prev | next [-]

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

coldtea 4 hours ago | parent [-]

It seems you haven't done the due diligence on what the parent meant :)

It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.

kang 4 hours ago | parent [-]

You not only skipped the diligence but confused everyone repeating what I said :(

that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).

The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.

raron 5 hours ago | parent | prev | next [-]

How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?

throwdbaaway 4 hours ago | parent | next [-]

Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache

With this much cheaper setup backed by disks, they can offer much better caching experience:

> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.

cyanydeez 4 hours ago | parent | prev [-]

I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.

The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.

miroljub 4 hours ago | parent | prev [-]

This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.

nixpulvis 4 hours ago | parent | prev | next [-]

How else would you implement it?

cyanydeez 4 hours ago | parent | prev [-]

It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.

So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...

You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.