> we’ve significantly reduced behavior where the models use shortcuts or loopholes to complete tasks. Both models are 65% less likely to engage in this behavior than Sonnet 3.7 on agentic tasks

Sounds like it’ll be better at writing meaningful tests

▲

UncleEntity a month ago | parent | next [-]

In my experience, when presented with a failing test it would simply try to make the test pass instead of determining why the test was failing. Usually by hard coding the test parameters (or whatever) in the failing function... which was super annoying.

	▲	0x457 a month ago \| parent [-]
		I once saw probably 10 iterations to fix a broken test, then it decided that we don't need this test at all, and it tried to just remove it. IMO, you either write tests and let it write implementation or write implementation and let it write tests. Maybe use something to write tests, then forbid "implementor" to modify them.

▲

NitpickLawyer a month ago | parent | prev [-]

One strategy that also works is to have 2 separate "sessions", have one write code and one write tests. Forbid one to change the other's "domain". Much better IME.