▲ | marcon680 3 days ago | |
We finetune our own VLMs for this -- unfortunately prefer not to share which ones we use specifically! ClickClickClick looks awesome, have you heard of FerretUI (https://arxiv.org/pdf/2404.05719)? Pretty similar idea. | ||
▲ | mkagenius 3 days ago | parent [-] | |
Yes, I tried a similar one called "omniparser" - where the issue was it was missing annotating some UI elements sometimes. Moreover, Gemini and Molmo worked right out of the box without needing any fine tune. |