Why most AI coding benchmarks are misleading (COMPASS paper)

Hi, I’m one of the authors. Happy to answer questions about the dataset (LLM coding performance compared to 390k+ human submissions), the scoring approach, or the methodology behind COMPASS. Feedback and critique are welcome.

	▲	sieep 20 hours ago \| parent [-]
		Hello, I'm someone who does not have a background in CS, so my apologies for not being able to read the paper in-full. Is there any clear-cut strategy you would recommend to model developers so they can improve in not just correctness, but in quality & efficiency? I'm sure it's in the paper & I wish I could understand it in-depth. If you don't mind me asking a more personal question, I would love to go back to uni for a master's in computer science & hopefully assist with papers like this one day. Do you have any advice for someone with industry CS experience (SWE) vs. academic to make the leap to the academic side? I genuinely love this kind of stuff and already make a decent living so it's not for money.