Remix clone Hacker News

	▲	enginoid 6 days ago
		There are some benchmarks such as Fiction.LiveBench[0] that give an indication and the new Graphwalks approach looks super interesting. But I'd love to see one specifically for "meaningful coding." Coding has specific properties that are important such as variable tracking (following coreference chains) described in RULER[1]. This paper also cautions against Single-Needle-In-The-Haystack tests which I think the OpenAI one might be. You really need at least Multi-NIAH for it to tell you anything meaningful, which is what they've done for the Gemini models. I think something a bit more interpretable like `pass@1 rate for coding turns at 128k` would so much more useful than "we have 1m context" (with the acknowledgement that good-enough performance is often domain dependant) [0] https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... [1] https://arxiv.org/pdf/2404.06654