Remix.run Logo
DRMacIver 5 hours ago

> But the problem remains verifying that the tests actually test what they're supposed to.

Definitely. It's a lot harder to fake this with PBT than with example-based testing, but you can still write bad property-based tests and agents are pretty good at doing so.

I have generally found that agents with property-based tests are much better at not lying to themselves about it than agents with just example-based testing, but I still spend a lot of time yelling at Claude.

> So "a huge part" - possibly, but there are other huge parts still missing.

No argument here. We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world.

sunshowers 30 minutes ago | parent | next [-]

A fun recent experience I had with Claude was I asked it to write a model for PBTs against a complex SUT, and it duplicated the SUT algorithm in the model — not helpful! I had to explicitly prompt it to write the model algorithm in a completely different style.

DRMacIver 7 minutes ago | parent [-]

Ugh, yeah. Duplicating the code under test is a bad habit that Claude has had when writing property-based tests from very early on and has never completely gone away.

Hmm now that you mention it we should add some instructions not to do that in the hegel-skill, though oddly I've not seen it doing it so far.

pron 5 hours ago | parent | prev | next [-]

> We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world.

Yeah, I know. Just an opportunity to talk about some of the delusions we're hearing from the "CEO class". Keep up the good work!

ngruhn 5 hours ago | parent | prev [-]

> I have generally found that agents with property-based tests are much better at not lying to themselves

I also observed the cheating to increase. I recently tried to do a specific optimization on a big complex function. Wrote a PBT that checks that the original function returns the same values as the optimized function on all inputs. I also tracked the runtime to confirm that performance improved. Then I let Claude loose. The PBT was great at spotting edge cases but eventually Claude always started cheating: it modified the test, it modified the original function, it implemented other (easier) optimizations, ...

DRMacIver 4 hours ago | parent [-]

Ouch. Classic Claude. It does tend to cheat when it gets stuck, and I've had some success with stricter harnesses, reflection prompts and getting it to redo work when it notices it's cheated, but it's definitely not a solved problem.

My guess is that you wouldn't have had a better time without PBT here and it would still have either cheated or claimed victory incorrectly, but definitely agreed that PBT can't fully fix the problem, especially if it's PBT that the agent is allowed to modify. I've still anecdotally found that the results are better than without it because even if agents will often cheat when problems are pointed out, they'll definitely cheat if problems aren't pointed out.