| ▲ | KV cache is becoming the memory hierarchy of inference(touchdown-labs.com) | |||||||||||||
| 43 points by matt_d 2 days ago | 7 comments | ||||||||||||||
| ▲ | tptacek an hour ago | parent | next [-] | |||||||||||||
There's like an interesting systems article here, but at this point I'd rather they just gave me the prompt they used to generate it, so I can read it interactively in my own GPT5.5 session. | ||||||||||||||
| ▲ | htk 2 hours ago | parent | prev | next [-] | |||||||||||||
Hard to read article. The writing is curiously more robotic and repetitive than those written by AI. | ||||||||||||||
| ||||||||||||||
| ▲ | tyleo 37 minutes ago | parent | prev | next [-] | |||||||||||||
I generally don’t fill my context with enough stuff that this becomes a problem. I don’t think more data = better on the token side. Instead I’d be researching with focused prompts or subagents and surfacing only relevant context to a primary agent. | ||||||||||||||
| ▲ | burakemir an hour ago | parent | prev | next [-] | |||||||||||||
Take this with a grain of salt as I am new to this but IMHO for establishing memory hierarchy once and for all, it would be more helpful to present some abstract theory that * Explains prefill (time to first token TTFT) vs decode (time between tokens TBT aka 1/tps) * The various ways to schedule the computation, and the roles of runtime vs driver * The scenarios and choices, taking into account traffic patterns, whether you are an inference service or doing batch or claw whatnot. | ||||||||||||||
| ▲ | cyanydeez an hour ago | parent | prev [-] | |||||||||||||
ok, so for anyone whose not played with local models and watched what's going on with the KV cache: 1. You send your prompt, and now adays, whatever harness you're using sends a whole mess of context: available skills, tools, guardrails, etc. The GPU/inference engine starts processing it into tokens. This is the "Prompt Processing" speed and it's the fastest portion of inference, but is essentially "buffering" (text -> tokens). These tokens can be cached. 2. The inference then generates, more slowly, the next tokens; these I think are cached also (tokens -> text) Crucially: the KV cache is the _hardware_ cache; it is not a software layer currently, and even if it were, that'd make it extremely slow because it's storing _all_ the tokens in a conversation. So like all cache, cache eviction has to occur to free up the VRAM necessary. So if you had a conversation an hour ago, in the cloud, it's doubtful any of those tokens still exist so if you got up to 500k, you're going through step #1 again; if you're doing turn by turn immediately, you can skip to #2. So some of the reports in March about suddenly all the token gen allowance disappearing within hours was likely a KV cache/billing issue: they were charging you as if you were generating all those tokens for every back and forth. Whether it was a bug in billing vs a bug in programming, who knows. The trouble is that the traditional webserver type of proxy caching & load balancing tricks that helped scale the web don't work here! Your conversation with 100k context has to return to the same cluster, maybe even the same GPU to rely on the extraordinary fast KV cache reuse. | ||||||||||||||