▲ | bigmadshoe 3 days ago | |
This is called a "needle in a haystack" test, and all the 1M context models perform perfectly on this exact problem, at least when your prompt and the needle are sufficiently similar. As the piece above references, this is a totally insufficient test for the real world. Things like "find two unrelated facts tied together by a question, then perform reasoning based on them" are much harder. Scaling context properly is O(n^2). I'm not really up to date on what people are doing to combat this, but I find it hard to believe the jump from 100k -> 1m context window involved a 100x (10^2) slowdown, so they're probably taking some shortcut. |