Remix.run Logo
kmeisthax 6 days ago

The problem is that while you can train a model with the hyperparameter of "context size" set to 1M, there's very little 1M data to train on. Most of your model's ability to follow long context comes from the fact that it's trained on lots of (stolen) books; in fact I believe OpenAI just outright said in court that they can't do long context without training on books.

Novels are usually measured in terms of words; and there's a rule of thumb that four tokens make up about three words. So that 200k token wall you're hitting is right when most authors stop writing. 150k is already considered long for a novel, and to train 1M properly, you'd need not only a 750k book, but many of them. Humans just don't write or read that much text at once.

To get around this, whoever is training these models would need to change their training strategy to either:

- Group books in a series together as a single, very long text to be trained on

- Train on multiple unrelated books at once in the same context window

- Amplify the gradients by the length of the text being trained on so that the fewer long texts that do exist have greater influence on the model weights as a whole.

I suspect they're doing #2, just to get some gradients onto the longer end of the context window, but that also is going to diminish long-context reasoning because there's no reason for the model to develop a connection between, say, token 32 and token 985,234.

omneity 6 days ago | parent | next [-]

I'm not sure to which extent this opinion is accurately informed. It is well known that nobody trains on 1M token-long content. It wouldn't work anyway as the dependencies are too far fetched and you end up with vanishing gradients.

RoPE (Rotary Positional Embeddings, think modulo or periodic arithmetics) scaling is key, whereby the model is trained on 16k tokens long content, and then scaled up to 100k+ [0]. Qwen 1M (who has near perfect recall over the complete window [1]) and Llama 4 10M pushed the limits of this technique, with Qwen reliably training with a much higher RoPE base, and Llama 4 coming up with iRoPE which claims scaling to extremely long contexts up to infinity.

[0]: https://arxiv.org/html/2310.05209v2

[1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-retriev...

christianqchung 6 days ago | parent | next [-]

But Llama 4 Scout does badly on long context benchmarks despite claiming 10M. It scores 1 slot above Llama 3.1 8B in this one[1].

[1] https://github.com/adobe-research/NoLiMa

omneity 6 days ago | parent [-]

Indeed, but it does not take away the fact that long context is not trained through long content but by scaling short content instead.

kmeisthax 6 days ago | parent | prev [-]

Is there any evidence that GPT-4.1 is using RoPE to scale context?

Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example.

omneity 6 days ago | parent [-]

I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE.

Re: Llama 4, please see the sibling comment.

killerstorm 5 days ago | parent | prev | next [-]

No, there's a fundamental limitation of Transformer architecture:

  * information from the entire context has to be squeezed into an information channel of a fixed size; the more information you try to squeeze the more noise you get
  * selection of what information passes through is done using just dot-product
Training data isn't the problem.

In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too.

wskish 6 days ago | parent | prev | next [-]

codebases of high quality open source projects and their major dependencies are probably another good source. also: "transformative fair use", not "stolen"

crimsoneer 6 days ago | parent | prev | next [-]

Isn't the problem more that the "needle in a haystack" eval (i said word X once, where) is really not relevant to most long context LLM use cases like code, where you need the context from all the stuff simultaneously rather than identifying a single, quite separate relevant section?

omneity 6 days ago | parent [-]

What you're describing as "needle in a haystack" is a necessary requirement for the downstream ability you want. The distinction is really how many "things" the LLM can process in a single shot.

LLMs process tokens sequentially, first in a prefilling stage, where it reads your input, then in the generation stage where it outputs response tokens. The attention mechanism is what allows the LLM as it is ingesting or producing tokens to "notice" that a token it has seen previously (your instruction) is related with a token it is now seeing (the code).

Of course this mechanism has limits (correlated with model size), and if the LLM needs to take the whole input in consideration to answer the question the results wouldn't be too good.

roflmaostc 6 days ago | parent | prev | next [-]

What about old books? Wikipedia? Law texts? Programming languages documentations?

How many tokens is a 100 pages PDF? 10k to 100k?

arvindh-manian 6 days ago | parent | next [-]

For reference, I think a common approximation is one token being 0.75 words.

For a 100 page book, that translates to around 50,000 tokens. For 1 mil+ tokens, we need to be looking at 2000+ page books. That's pretty rare, even for documentation.

It doesn't have to be text-based, though. I could see films and TV shows becoming increasingly important for long-context model training.

handfuloflight 6 days ago | parent [-]

What about the role of synthetic data?

throwup238 6 days ago | parent [-]

Synthetic data requires a discriminator that can select the highest quality results to feed back into training. Training a discriminator is easier than a full blown LLM, but it still suffers from a lack of high quality training data in the case of 1M context windows. How do you train a discriminator to select good 2,000 page synthetic books if the only ones you have to train it with are Proust and concatenated Harry Potter/Game of Thrones/etc.

jjmarr 6 days ago | parent | prev [-]

Wikipedia does not have many pages that are 750k words. According to Special:LongPages[1], the longest page right now is a little under 750k bytes.

https://en.wikipedia.org/wiki/List_of_chiropterans

Despite listing all presently known bats, the majority of "list of chiropterans" byte count is code that generates references to the IUCN Red List, not actual text. Most of Wikipedia's longest articles are code.

[1] https://en.wikipedia.org/wiki/Special:LongPages

6 days ago | parent | prev | next [-]
[deleted]
nneonneo 6 days ago | parent | prev [-]

I mean, can’t they just train on some huge codebases? There’s lots of 100KLOC codebases out there which would probably get close to 1M tokens.