swe-bench's bigger problems include (1) labs train on the test and (2) 50% of the tickets are from django; it's not a representative dataset even if all you care about is Python.

I created a new benchmark from Java commits that are new in the past 6 months to add some variety: https://brokk.ai/power-ranking

▲

lostmsu 5 days ago | parent [-]

No GLM?

▲

jbellis 5 days ago | parent [-]

no, I'm pretty skeptical that it's better than qwen3 coder

but if you have evidence that it could be, I'm down to test it

▲

lostmsu 5 days ago | parent [-]

It has the same score on https://lmarena.ai/leaderboard/webdev , but AFAIK Air version is much smaller.

	▲	jbellis 4 days ago \| parent [-]
		I've added results for GLM 4.5 and 4.5 Air.