Remix.run Logo
NitpickLawyer 2 hours ago

This might actually be the whole value prop of this benchmark. Forget their initial scores, take open models (so we can be sure the base doesn't change), and test different combinations of harness + prompts + strategies + whatever memthing is popular today. See if the scores improve. Repeat.