▲ | james_marks a day ago | |||||||
> we’ve significantly reduced behavior where the models use shortcuts or loopholes to complete tasks. Both models are 65% less likely to engage in this behavior than Sonnet 3.7 on agentic tasks Sounds like it’ll be better at writing meaningful tests | ||||||||
▲ | NitpickLawyer a day ago | parent | next [-] | |||||||
One strategy that also works is to have 2 separate "sessions", have one write code and one write tests. Forbid one to change the other's "domain". Much better IME. | ||||||||
▲ | UncleEntity a day ago | parent | prev [-] | |||||||
In my experience, when presented with a failing test it would simply try to make the test pass instead of determining why the test was failing. Usually by hard coding the test parameters (or whatever) in the failing function... which was super annoying. | ||||||||
|