▲ | NitpickLawyer a day ago | |||||||||||||||||||||||||
> The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test. I've been recently thinking about testing/qa w/ VLMs + LLMs, one area that I haven't seen explored (but should 100% be feasible) is to have the first run be LLM + VLM, and then have the LLM(s?) write repeatable "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On every run you do the "cheap" traditional checks, if any fail go with the LLM + VLM again and see what broke, only fail the test if both fail. Makes sense? | ||||||||||||||||||||||||||
▲ | anerli a day ago | parent | next [-] | |||||||||||||||||||||||||
So this is a path that we definitely considered. However we think its a half-measure to generate actual Playwright code and just run that. Because if you do that, you still have a brittle test at the end of the day, and once it breaks you would need to pull in some LLM to try and adapt it anyway. Instead of caching actual code, we cache a "plan" of specific web actions that are still described in natural language. For example, a cached "typing" action might look like: { variant: 'type'; target: string; content: string; } The target is a natural language description. The content is what to type. Moondream's job is simply to find the target, and then we will click into that target and type whatever content. This means it can be full vision and not rely on DOM at all, while still being very consistent. Moondream is also trivially cheap to run since it's only a 2B model. If it can't find the target or it's confidence changed significantly (using token probabilities), it's an indication that the action/plan requires adjustment, and we can dynamically swap in the planner LLM to decide how to adjust the test from there. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | tomatohs 17 hours ago | parent | prev [-] | |||||||||||||||||||||||||
This is exactly our workflow, though we defined our own YAML spec [1] for reasons mentioned in previous comments. We have multiple fallbacks to prevent flakes; The "cheap" command, a description of the intended step, and the original prompt. If any step fails, we fall back to the next source. |