> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.

▲

doesnt_know 5 hours ago | parent | next [-]

How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

	▲	tempest_ an hour ago \| parent [-]
		I use CC, and I understand what caching means. I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.

▲

solarkraft 6 hours ago | parent | prev | next [-]

I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

▲

mpyne 4 hours ago | parent [-]

> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

Does mmap(2) educate the developer on how disk I/O works?

At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.

▲

websap an hour ago | parent | next [-]

Does using print() in Python means I need to understand the Kernel? This is an absurd thought.

▲

zem 3 hours ago | parent | prev [-]

mmap(2) and all its underlying machinery are open source and well documented besides.

	▲	mpyne 3 hours ago \| parent [-]
		There are open-source and even open-weight models that operate in exactly this way (as it's based off of years of public research), and even if there weren't the way that LLMs generate responses to inputs is superbly documented. Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything. It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.

▲

margalabargala 5 hours ago | parent | prev | next [-]

Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.

▲

someguyiguess 6 hours ago | parent | prev | next [-]

Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.

▲

jghn 2 hours ago | parent | next [-]

To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.

So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.

▲

abustamam 2 hours ago | parent [-]

> users should be curious and actively attempting to understand how it works

Have you ever talked with users?

> this is an endless job

Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.

	▲	Octoth0rpe an hour ago \| parent [-]
		There are general users of the average SaaS, and there are claude code users. There's no doubt in my mind that our expectations should be somewhat higher for CC users re: memory. I'm personally not completely convinced that cache eviction should be part of their thought process while using CC, but it's not _that_ much of a stretch.

▲

coldtea 4 hours ago | parent | prev [-]

It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!

▲

esafak 4 hours ago | parent [-]

They have to know that this could bite them and to ask the question first.

	▲	nixpulvis 4 hours ago \| parent [-]
		I do think having some insight into the current state of the cache and a realistic estimate for prompt token use is something we should demand.

▲

exac 4 hours ago | parent | prev | next [-]

It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.

▲

kovek 5 hours ago | parent | prev | next [-]

What if the cache was backed up to cold storage? Instead of having to recompute everything.

▲

bontaq 4 hours ago | parent | prev | next [-]

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

	▲	jannyfer 3 hours ago \| parent [-]
		I’m not sure that it’s O(N) with caching but this illustrates the N^2 part: https://blog.exe.dev/expensively-quadratic

▲

kang 5 hours ago | parent | prev | next [-]

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

▲

coldtea 4 hours ago | parent [-]

It seems you haven't done the due diligence on what the parent meant :)

It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.

	▲	kang 4 hours ago \| parent [-]
		You not only skipped the diligence but confused everyone repeating what I said :( that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt). The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.

▲

raron 5 hours ago | parent | prev | next [-]

How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?

	▲	throwdbaaway 4 hours ago \| parent \| next [-]
		Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache With this much cheaper setup backed by disks, they can offer much better caching experience: > Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.
	▲	cyanydeez 4 hours ago \| parent \| prev [-]
		I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done. The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.

▲

miroljub 4 hours ago | parent | prev [-]

This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.