| ▲ | miguel_martin 2 hours ago | |
It’s unfortunate that they didn’t eval using subagents/orchestration for such a complex set of tasks (from what I can tell), e.g. analyze program to produce initial spec -> code -> review and rinse&repeat with each of those steps being a separate subagent allocated I would be interested to see if there’s a significant quantifiable difference. | ||
| ▲ | NitpickLawyer 2 hours ago | parent [-] | |
This might actually be the whole value prop of this benchmark. Forget their initial scores, take open models (so we can be sure the base doesn't change), and test different combinations of harness + prompts + strategies + whatever memthing is popular today. See if the scores improve. Repeat. | ||