▲ | jbellis 5 days ago | |||||||||||||||||||||||||
swe-bench's bigger problems include (1) labs train on the test and (2) 50% of the tickets are from django; it's not a representative dataset even if all you care about is Python. I created a new benchmark from Java commits that are new in the past 6 months to add some variety: https://brokk.ai/power-ranking | ||||||||||||||||||||||||||
▲ | lostmsu 5 days ago | parent [-] | |||||||||||||||||||||||||
No GLM? | ||||||||||||||||||||||||||
|