▲ | kmeisthax 6 days ago | |
Is there any evidence that GPT-4.1 is using RoPE to scale context? Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example. | ||
▲ | omneity 6 days ago | parent [-] | |
I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE. Re: Llama 4, please see the sibling comment. |