▲ | jbentley1 6 days ago | |
https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance. |