▲ | kmeisthax 6 days ago | |||||||||||||||||||||||||||||||
The problem is that while you can train a model with the hyperparameter of "context size" set to 1M, there's very little 1M data to train on. Most of your model's ability to follow long context comes from the fact that it's trained on lots of (stolen) books; in fact I believe OpenAI just outright said in court that they can't do long context without training on books. Novels are usually measured in terms of words; and there's a rule of thumb that four tokens make up about three words. So that 200k token wall you're hitting is right when most authors stop writing. 150k is already considered long for a novel, and to train 1M properly, you'd need not only a 750k book, but many of them. Humans just don't write or read that much text at once. To get around this, whoever is training these models would need to change their training strategy to either: - Group books in a series together as a single, very long text to be trained on - Train on multiple unrelated books at once in the same context window - Amplify the gradients by the length of the text being trained on so that the fewer long texts that do exist have greater influence on the model weights as a whole. I suspect they're doing #2, just to get some gradients onto the longer end of the context window, but that also is going to diminish long-context reasoning because there's no reason for the model to develop a connection between, say, token 32 and token 985,234. | ||||||||||||||||||||||||||||||||
▲ | omneity 6 days ago | parent | next [-] | |||||||||||||||||||||||||||||||
I'm not sure to which extent this opinion is accurately informed. It is well known that nobody trains on 1M token-long content. It wouldn't work anyway as the dependencies are too far fetched and you end up with vanishing gradients. RoPE (Rotary Positional Embeddings, think modulo or periodic arithmetics) scaling is key, whereby the model is trained on 16k tokens long content, and then scaled up to 100k+ [0]. Qwen 1M (who has near perfect recall over the complete window [1]) and Llama 4 10M pushed the limits of this technique, with Qwen reliably training with a much higher RoPE base, and Llama 4 coming up with iRoPE which claims scaling to extremely long contexts up to infinity. [0]: https://arxiv.org/html/2310.05209v2 [1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-retriev... | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | killerstorm 5 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
No, there's a fundamental limitation of Transformer architecture:
Training data isn't the problem.In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too. | ||||||||||||||||||||||||||||||||
▲ | wskish 6 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
codebases of high quality open source projects and their major dependencies are probably another good source. also: "transformative fair use", not "stolen" | ||||||||||||||||||||||||||||||||
▲ | crimsoneer 6 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Isn't the problem more that the "needle in a haystack" eval (i said word X once, where) is really not relevant to most long context LLM use cases like code, where you need the context from all the stuff simultaneously rather than identifying a single, quite separate relevant section? | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | roflmaostc 6 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
What about old books? Wikipedia? Law texts? Programming languages documentations? How many tokens is a 100 pages PDF? 10k to 100k? | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | 6 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
[deleted] | ||||||||||||||||||||||||||||||||
▲ | nneonneo 6 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||
I mean, can’t they just train on some huge codebases? There’s lots of 100KLOC codebases out there which would probably get close to 1M tokens. |