Remix.run Logo
stinkbeetle 6 days ago

> > I fully predict that blind people will be advocating to make actual LLM platforms accessible

> Absolutely. The LLM platforms indeed very much should be accessible. I don't think anyone would have beef with that.

AIs I have used have fairly basic interfaces - input some text or an image and get back some text or an image - is that not something that accessibility tools can already do? Or do they mean something else by "actual LLM platform"? This isn't a rhetorical question, I don't know much about interfaces for the blind.

devinprater 6 days ago | parent | next [-]

Oh no, cause screen readers are dumb things. If you don't send them an announcement, through live regions or accessibility announcements on Android or iOS, they will not know that a response has been received. So, the user will just sit there and have to tap and tap to see when a response comes in. This is especially frustrating with streaming responses where you're not sure when streaming has completed. Gemini for Android is awful at this when typing to it while using TalkBack. No announcements. Claude on web and Android also do nothing, and on iOS it at least places focus, accidentally I suspect, at the beginning of the response. chatGPT on iOS and web are great; it tells me when a response is being generated and reads it out when it's done. On iOS, it sends each line to VoiceOver as it's being generated. AI companies, and companies in general, need to understand that not all blind people talk to their devices.

simonw 6 days ago | parent | next [-]

Sounds like I should reverse engineer the ChatGPT web app and see what they're doing.

agos 6 days ago | parent | prev [-]

dang, I was hoping that with the impossibly simple interface chatGPT has and the basically unlimited budget they have, they would have done a bit better for accessibility. shameful

simonw 6 days ago | parent | prev [-]

I've been having trouble figuring out how best to implement a streaming text display interface in a way that's certain to work well with screenreaders.

miki123211 6 days ago | parent | next [-]

This really depends on the language.

In some languages, pronunciation(a+b) == pronunciation(a) + pronunciation(b). Polish mostly belongs to this category, for example. For these, it's enough to go token-by-token.

For English, it is not that simple, as e.g. the "uni" in "university" sounds completely different to the "uni" in "uninteresting."

In English, even going word-by-word isn't enough, as words like "read" or "live" have multiple pronunciations, and speech synthesizers rely on the surrounding context to choose which one to use. This means you probably need to go by sentence.

Then you have the problem of what to do with code, tables, headings etc. While screen readers can announce roles as you navigate text, they cannot do so when announcing the contents of the live region, so if that's something you want, you'de need to build a micro screen-reader of sorts.

devinprater 6 days ago | parent | prev [-]

If it's command-line based, maybe stream based on lines, or even better, sentences rather than received tokens.