| ▲ | int0x29 an hour ago | |
> However, I would like to point out that Apple isn't totally wrong here because the accessibility API unfortunately is way too broadly scoped, and because of that you literally get access to everything on the computer like you you can screenshot listen and and move the cursor... This is completely ridiculous and the proper engineering solution would actually be to phase out the accessibility API and replace it with something that is narrowly scoped so you can grant specific permissions individually If you don't have use of your hands you want that. The whole point of accessibility APIs is allowing arbitrary control of your computer via novel means. One of the big selling points of Dragon Natually Speaking is the ability to tell your computer to do things based on descriptions without a mouse. "open outlook", "click compose", "select subject", "type foo", etc. Unfortunately modern software breaks this a lot. Chrome and anything electron based don't provide any accessibility information to the OS. The interior of the window excluding the tab bar is a void. Yes chrome has an inbuilt screen reader as do a number of electron apps. But if you aren't blind and want to use something like Dragon it doesn't work. Canvas based apps are often the same. And no the solution here is not computer vision with an LLM. Text and buttons rendered on my computer exist in memory somewhere as text and buttons. We should not need to convert them to pixels and back lossily to recover text and buttons. We should just expose things to the accessibility API and not guess. | ||
| ▲ | patates 14 minutes ago | parent | next [-] | |
> Chrome and anything electron based don't provide any accessibility information to the OS Are we sure about this? At least on windows, NVDA works fine with chrome and any electron apps. | ||
| ▲ | Wowfunhappy 42 minutes ago | parent | prev [-] | |
> And no the solution here is not computer vision with an LLM. Also, even if you hypothetically wanted to use computer vision with an LLM… what API is that LLM going to use to take screenshots and click on stuff? | ||