Remix clone Hacker News

new | show | ask | jobs Github

	▲	uriegas 4 hours ago
		I do agree with his identification of the problem: sometimes agents fail because of the tools around it and not because of the model's reasoning. However, for the failing tests I think he is not making the distinction between a failed test due to a harness failure or due to a reasoning failure. It would be nice if someone analyzed that from the data set.