Remix clone Hacker News

new | show | ask | jobs Github

	▲	trunch 11 hours ago
		Which of the LiveCodeBench Pro and SWE-Bench Verified benchmarks comes closer to everyday coding assistant tasks? Because it seems to lead by a decent margin on the former and trails behind on the latter
	▲	veselin 10 hours ago \| parent \| next [-]
		I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side. However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.
	▲	Snuggly73 10 hours ago \| parent \| prev [-]
		Neither :( LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.