Remix clone Hacker News

	▲	jbentley1 6 days ago
		https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance.