Remix.run Logo
svnt 9 hours ago

> This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything.

> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.

This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?

Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.

palashawas 8 hours ago | parent [-]

This is a fair point.

The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares

Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now