| ▲ | etothet 6 hours ago | |
Vision has a long way to go. I remember trying an early version of AWS's Nova Act and laughed at how slow it was. And a few months later it hadn't really seemed to improve that much. Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done. Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach. A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page. | ||
| ▲ | nijave an hour ago | parent [-] | |
Would a lightweight motion detection algorithm work there? Thinking of Frigate NVR that does motion > object detection > scene description Where you build up to progressively slower and more expensive algorithms i.e. there's motion > it's a person > here's what the person is doing | ||