Remix.run Logo
ants_everywhere 3 days ago

I exhaust the 1 million context windows on multiple models multiple times per day.

I haven't used the Llama 4 10 million context window so I don't know how it performs in practice compared to the major non-open-source offerings that have smaller context windows.

But there is an induced demand effect where as the context window increases it opens up more possibilities, and those possibilities can get bottlenecked on requiring an even bigger context window size.

For example, consider the idea of storing all Hollywood films on your computer. In the 1980s this was impossible. If you store them in DVD or Bluray quality you could probably do it in a few terabytes. If you store them in full quality you may be talking about petabytes.

We recently struggled to get a full file into a context window. Now a lot of people feel a bit like "just take the whole repo, it's only a few MB".

brulard 3 days ago | parent [-]

I think you misunderstand how context in current LLMs works. To get the best results you have to be very careful to provide what is needed for immediate task progression, and postpone context thats needed later in the process. If you give all the context at once, you will likely get quite degraded output quality. Thats like if you want to give a junior developer his first task, you likely won't teach him every corner of your app. You would give him context he needs. It is similar with these models. Those that provided 1M or 2M of context (Gemini etc.) were getting less and less useful after cca 200k tokens in the context.

Maybe models would get better in picking up relevant information from large context, but AFAIK it is not the case today.

remexre 3 days ago | parent | next [-]

That's a really anthropomorphizing description; a more mechanical one might be,

The attention mechanism that transformers use to find information in the context is, in its simplest form, O(n^2); for each token position, the model considers whether relevant information has been produced at the position of every other token.

To preserve performance when really long contexts are used, current-generation LLMs use various ways to consider fewer positions in the context; for example, they might only consider the 4096 "most likely" places to matter (de-emphasizing large numbers of "subtle hints" that something isn't correct), or they might have some way of combining multiple tokens worth of information into a single value (losing some fine detail).

ants_everywhere 3 days ago | parent | prev | next [-]

> I think you misunderstand how context in current LLMs works.

Thanks but I don't and I'm not sure why you're jumping to this conclusion.

EDIT: Oh I think you're talking about the last bit of the comment! If you read the one before I say that feeding it the entire repo isn't a great idea. But great idea or not, people want to do it, and it illustrates that as context window increases it creates demand for even larger context windows.

brulard 2 days ago | parent [-]

I said that based on you saying you exhaust a million token context windows easily. I'm no expert on that, but I think the current state of LLMs works best if you are not approaching that 1M token limit, because large context (reportedly) deteriorates response quality quickly. I think state of the art usage is managing context in tens or low hundreds thousands tokens at most and taking advantage of splitting tasks across subtasks in time, or splitting context across multiple "expert" agents (see sub-agents in claude code).

jimbokun 3 days ago | parent | prev [-]

It seems like LLM need to become experts at managing their OWN context.

Selectively gripping and searching the code to pull into context only those parts relevant to the task at hand.

brulard 2 days ago | parent [-]

That's what I'm thinking about a lot. Something like the models "activate" just some subset of parameters when working (if I understand the new models correctly). So that model could activate parts of context which are relevant for the task at hand