| ▲ | DRMacIver 5 hours ago | |||||||
> But the problem remains verifying that the tests actually test what they're supposed to. Definitely. It's a lot harder to fake this with PBT than with example-based testing, but you can still write bad property-based tests and agents are pretty good at doing so. I have generally found that agents with property-based tests are much better at not lying to themselves about it than agents with just example-based testing, but I still spend a lot of time yelling at Claude. > So "a huge part" - possibly, but there are other huge parts still missing. No argument here. We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world. | ||||||||
| ▲ | sunshowers 30 minutes ago | parent | next [-] | |||||||
A fun recent experience I had with Claude was I asked it to write a model for PBTs against a complex SUT, and it duplicated the SUT algorithm in the model — not helpful! I had to explicitly prompt it to write the model algorithm in a completely different style. | ||||||||
| ||||||||
| ▲ | pron 5 hours ago | parent | prev | next [-] | |||||||
> We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world. Yeah, I know. Just an opportunity to talk about some of the delusions we're hearing from the "CEO class". Keep up the good work! | ||||||||
| ▲ | ngruhn 5 hours ago | parent | prev [-] | |||||||
> I have generally found that agents with property-based tests are much better at not lying to themselves I also observed the cheating to increase. I recently tried to do a specific optimization on a big complex function. Wrote a PBT that checks that the original function returns the same values as the optimized function on all inputs. I also tracked the runtime to confirm that performance improved. Then I let Claude loose. The PBT was great at spotting edge cases but eventually Claude always started cheating: it modified the test, it modified the original function, it implemented other (easier) optimizations, ... | ||||||||
| ||||||||