Remix.run Logo
hawtads a day ago

Okay, here's the tl;dr:

Attention based neural network architectures (on which the majority of LLMs are built) has a unit economic cost that scales (roughly) n^2 i.e. quadratic (for both memory and compute). In other words, the longer the context window, the more expensive it is for the upstream provider. That's one cost.

The second cost is that you have to resend the entire context every time you send a new message. So the context is basically (where a, b, and c are messages): first context: a, second context window: a->b, third context window: a->b->c. It's a mostly stateless (there are some short term caching mechanisms, YMMV based on provider, it's why "cached" messages, especially system prompts are cheaper) process from the point of view of the developer, the state i.e. context window string is managed by the end user application (in other words, the coding agent, the IDE, the ChatGPT UI client etc.)

The per token cost is an amortized (averaged) cost of memory+compute, the actual cost is mostly quadratic with respect to each marginal token. The longer the context window the more expensive things are. Because of the above, AI agent providers (especially those that charge flat fee subscription plans) are incentivized to keep costs low by limiting the maximum context window size.

(And if you think about it carefully, your AI API costs are a quadratic cost curve projected into a linear line (flat fee per token, so the model hosting provider in some cases may make more profit if users send in shorter contexts, versus if they constantly saturate the window. YMMV of course, but it's a race to the bottom right now for LLM unit economics)

They do this by interrupting a task halfway through and generating a "summary" of the task progress, then they prompt the LLM again with a fresh prompt and the "summary" so far and the LLM will restart the task from where it left of. Of course text is a poor representation of the LLM's internal state but it's the best option so far for AI application to keep costs low.

Another thing to keep in mind is that LLMs have poorer performance the larger the input size. This is due to a variety of factors (mostly because you don't have enough training data to saturate the massive context window sizes I think).

The general graph for LLM context performance looks something like this: https://cobusgreyling.medium.com/llm-context-rot-28a6d039965... https://research.trychroma.com/context-rot

There are a bunch of tests and benchmarks (commonly referred to as "needle in a haystack") to improve the LLM performance at large context window sizes, but it's still an open area of research.

https://cloud.google.com/blog/products/ai-machine-learning/t...

The thing is, generally speaking, you will get a slightly better performance if you can squeeze all your code and problem into the context window, because the LLM can get a "whole picture" view of your codebase/problem, instead of a bunch of broken telephone summaries every dozen of thousands of tokens. Take this with a grain of salt as the field is changing rapidly so it might not be valid in a month or two.

Keep in mind that if the problem you are solving requires you to saturate the entire context window of the LLM, a single request can cost you dollars. And if you are using 1M+ context window model like gemini, you can rack up costs fairly rapidly.

mcv 9 hours ago | parent [-]

Using Opus 4.5, I have noticed that in long sessions about a complex topic, there often comes a point when Opus starts spouting utter gibberish. One or two questions earlier it was making total sense, and suddenly it seems to have forgotten everything and responds in a way that barely relates to the question I asked, and certainly not to the "conversation" we were having.

Is that a sign of having having surpassed that context window size? I guess to keep them sharp, I should start a new session often and early.

From what I understand, a token is either a word or a character, so I can use 100k words or characters before I start running into limits. But I've got the feeling that the complexity of the problem itself also matters.

hawtads 3 hours ago | parent [-]

It could have exceeded either its real context window size (or the artificially truncated one) and the dynamic summarization step failed to capture the important bits of information you wanted. Alternatively, the information might be stored in certain places in the context window where it failed to perform well in needle in haystack retrieval.

This is part of the reason why people use external data stores (e.g. vector databases, graph tools like Bead etc. in the hope of supplementing the agent's native context window and task management tools).

https://github.com/steveyegge/beads

The whole field is still in its infancy. Who knows, maybe in another update or two the problem might just be solved. It's not like needle in the haystack problems aren't differentiable (mathematically speaking).