Remix.run Logo
lmeyerov 3 hours ago

I've had an interesting comparative experience so far using Alloy, which is a 'simpler' formal method approach that you can think of as smarter exhaustive fuzzing for your properties of interest.

We have been building GFQL, an open source graph query language that can run at the compute tier seperate from whatever existing storage system (splunk, SQL, cs , ..), making graphs easy for interactive notebooks/dashboards/analytics/pipelines/agents without needing graph DBs, even at billion scales. Part of the trick is it is a bit unusual inside too because we are making cypher-style queries run with columnar operations (purely vectorized for both CPU + GPU mode via a multi-target dataframe abstraction layer), but that is taking a few innovations in the interpreter algorithm so we worry about subtle semantic bugs. Most of our bugs can surface with small (graph, query) combos, so this sounded perfect for Alloy's small scope hypothesis!

So we are having a natural experiment for which does more bug finding:

- asking Claude code to analyze new code or recently found issues and using that for unit test amplification

- asking Claude code to use the alloy model checker to prove out our space of (query, graph) for key methods

- asking Claude code to port a large cypher conformance test suite

So far, Claude Code as a test amplifier is doing so much better than the others that it's our Dr default preference. The real gem is when we make it do a '5 whys root cause analysis' on why some bug got through and then it does test amplification around several categories of potential error.

We found the conformance suite to be 'class imbalanced' by focusing on a couple features we are adding next, so jury still out on that one. And finally... Alloy hasn't really found anything. We suspect we need to change how we use Alloy to be more reflective of the kinds of bugs we find, but 0 was disappointing.

If anyone is into this kind of thing, happy to discuss/collab!