Remix.run Logo
bigmadshoe 3 days ago

This is called a "needle in a haystack" test, and all the 1M context models perform perfectly on this exact problem, at least when your prompt and the needle are sufficiently similar.

As the piece above references, this is a totally insufficient test for the real world. Things like "find two unrelated facts tied together by a question, then perform reasoning based on them" are much harder.

Scaling context properly is O(n^2). I'm not really up to date on what people are doing to combat this, but I find it hard to believe the jump from 100k -> 1m context window involved a 100x (10^2) slowdown, so they're probably taking some shortcut.