| ▲ | julius 8 hours ago | ||||||||||||||||
Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coordinates. I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet) GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals. Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http | |||||||||||||||||
| ▲ | withinrafael 5 hours ago | parent | next [-] | ||||||||||||||||
I've had lots of success with generating coordinates and answering questions using the UI-TARS model https://github.com/bytedance/UI-TARS. | |||||||||||||||||
| |||||||||||||||||
| ▲ | lopuhin 5 hours ago | parent | prev | next [-] | ||||||||||||||||
Qwen3.5 is able to output click coordinates and bounding boxes just fine, as values normalized to 0..1000, I’d hope Qwen3.6 didn’t loose this capability. | |||||||||||||||||
| ▲ | cyanydeez 7 hours ago | parent | prev [-] | ||||||||||||||||
This sounds a lot like another hacker news posted in the last few days. The same problem image generators have with a prompt like, produce numbers 1-50 in a spiral pattern and it can't count properly. But if you break it into a raster/vector where you have it first produce the visual content and then a SVG overlay, it's completely capable. Have you tried doing a two step: review the image, then render a vector? | |||||||||||||||||
| |||||||||||||||||