Interesting! My first concern is - isn’t this the ultimate non-deterministic test? In practice, does it seem flaky?

So the architecture is built with determinism in mind. The plan-caching system is still a work in progress, but especially once fully implemented it should be very consistent. As long as your interface doesn't change (or changes in trivial ways), Moondream alone can execute the same exact web actions as previous test runs without relying on any DOM selectors. When the interface does eventually change, that's where it becomes non-deterministic again by necessity, since the planner will need to generatively update the test and continue building the new cache from there. However once it's been adapted, it can once again be executed that way every time until the interface changes again.

▲

daxfohl 6 months ago | parent | next [-]

In a way, nondeterminism could be an advantage. Instead of using these as unit tests, use them as usability tests. Especially if you want your site to be accessible by AI agents, it would be good to have a way of knowing what tweaks increase the success rate.

Of course that would be even more valuable for testing your MCP or A2A services, but could be useful for UI as well. Or it could be useless. It would be interesting to see if the same UI changes affect both human and AI success rate in the same way.

And if not, could an AI be trained to correlate more closely to human behavior. That could be a good selling point if possible.

	▲	anerli 6 months ago \| parent [-]
		Originally we were actually thinking about doing exactly this and building agents for usability testing. However, we think that LLMs are much better suited for tackling well defined tasks rather than trying to emulate human nuance, so we pivoted to end-to-end testing and figuring out how to make LLM browser agents act deterministically.

▲

engfan 6 months ago | parent | prev [-]

Anerli wrote: “When the interface does eventually change, that's where it becomes non-deterministic again by necessity, since the planner will need to generatively update the test and continue building the new cache from there.”

But what determines that the UI has changed for a specific URL? Your software independent of the planner LLM or do you require the visual LLM to make a determination of change?

You should also stop saying 100% open source when test plan generation and execution depend on non-open source AI components. It just doesn’t make sense.

	▲	anerli 6 months ago \| parent [-]
		The small VLM (Moondream) decides when interface changes / its actions no longer line up. We say 100% open source because all of our code (test runner and AI agents) is completely open source. It’s also completely possible to run an entire OSS stack because you can configure with an open source planner LLM, and Moondream is open source. You could run it all locally even if you have solid hardware.