"A benchmark for catching when code doesn't do what its documentation claims"

	▲	"A benchmark for catching when code doesn't do what its documentation claims"(github.com)
		3 points by o2zer0cool 11 hours ago \| 2 comments

	▲	westurner 9 hours ago \| parent \| next [-]
		Suggestions; would it be more maintainable to: Rewrite this with pytest-evals. Write pytest tests with pytest.mark.parametrize, fixtures, and mocks. Push to >90% branch coverage with pytest-cov. . I don't think any of these benchmarks yet do model output evals for docs?: Mcpbr > Supported Benchmarks: https://github.com/supermodeltools/mcpbr#supported-benchmark... . On subjectivity and language also the other day, this: https://github.com/mozilla/firefox-devtools-mcp/pull/90#issu... : > how to optimize an AGENTS.md: > [agentevals, foundry-toolkit, ]
	▲	o2zer0cool 11 hours ago \| parent \| prev [-]
		[flagged]