| ▲ | schipperai 2 hours ago | |
Cognition did well in documenting their approach [1]. TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit. | ||