Remix.run Logo
krisoft 6 days ago

> The blind and visually impaired people advocating for this have been conditioned to believe that technology will solve all accessibility problems because, simply put, humans won’t do it.

Technology is not just sprouting out of the ground out of its own. It is humans who are making it. Therefore if technology is helpful it was humans who helped.

> Let’s not mention the fact the particular large language model, LLM called Chat GPT they chose, was never the right kind of machine learning for the task of describing images.

Weird. I would think LLMs are exactly the right kind of tool to describe images. Sadly there is no more detail about what they think would be a better approach.

> I fully predict that blind people will be advocating to make actual LLM platforms accessible

Absolutely. The LLM platforms indeed very much should be accessible. I don't think anyone would have beef with that.

> I also predict web accessibility will actually get worse, not better, as coding models will spit out inaccessible code that developers won’t check or won’t even care to check.

Who knows. Either that, or some pages will become more accessible because the effort of making it accessible will be less on the part of the devs. It probably will be a mixed bag with a little bit of column A and column B.

> Now that AI is a thing now, I doubt OCR and even self-driving cars will get any significant advancements.

These are all AI. They are all improving leaps and bounds.

> An LLM will always be there, well, until the servers go down

Of course. That is a concern. This is why models you can run yourself are so important. Local models are good for latency and reliability. But even if the model is run on a remote server as long as you control the server you can decide when it becomes shut down.

lxgr 6 days ago | parent | next [-]

> > Let’s not mention the fact the particular large language model, LLM called Chat GPT they chose, was never the right kind of machine learning for the task of describing images.

> Weird. I would think LLMs are exactly the right kind of tool to describe images.

TFA is from 2023, when multimodal LLMs were just picking up. I do agree that that prediction (flat capability increase) has aged poorly.

> I doubt OCR and even self-driving cars will get any significant advancements.

This particular prediction has also aged quite poorly. Mistral OCR, an OCR-focused LLM, is working phenomenally well in my experience compared to "non-LLM OCRs".

stinkbeetle 6 days ago | parent | prev | next [-]

> > I fully predict that blind people will be advocating to make actual LLM platforms accessible

> Absolutely. The LLM platforms indeed very much should be accessible. I don't think anyone would have beef with that.

AIs I have used have fairly basic interfaces - input some text or an image and get back some text or an image - is that not something that accessibility tools can already do? Or do they mean something else by "actual LLM platform"? This isn't a rhetorical question, I don't know much about interfaces for the blind.

devinprater 6 days ago | parent | next [-]

Oh no, cause screen readers are dumb things. If you don't send them an announcement, through live regions or accessibility announcements on Android or iOS, they will not know that a response has been received. So, the user will just sit there and have to tap and tap to see when a response comes in. This is especially frustrating with streaming responses where you're not sure when streaming has completed. Gemini for Android is awful at this when typing to it while using TalkBack. No announcements. Claude on web and Android also do nothing, and on iOS it at least places focus, accidentally I suspect, at the beginning of the response. chatGPT on iOS and web are great; it tells me when a response is being generated and reads it out when it's done. On iOS, it sends each line to VoiceOver as it's being generated. AI companies, and companies in general, need to understand that not all blind people talk to their devices.

simonw 6 days ago | parent | next [-]

Sounds like I should reverse engineer the ChatGPT web app and see what they're doing.

agos 6 days ago | parent | prev [-]

dang, I was hoping that with the impossibly simple interface chatGPT has and the basically unlimited budget they have, they would have done a bit better for accessibility. shameful

simonw 6 days ago | parent | prev [-]

I've been having trouble figuring out how best to implement a streaming text display interface in a way that's certain to work well with screenreaders.

miki123211 6 days ago | parent | next [-]

This really depends on the language.

In some languages, pronunciation(a+b) == pronunciation(a) + pronunciation(b). Polish mostly belongs to this category, for example. For these, it's enough to go token-by-token.

For English, it is not that simple, as e.g. the "uni" in "university" sounds completely different to the "uni" in "uninteresting."

In English, even going word-by-word isn't enough, as words like "read" or "live" have multiple pronunciations, and speech synthesizers rely on the surrounding context to choose which one to use. This means you probably need to go by sentence.

Then you have the problem of what to do with code, tables, headings etc. While screen readers can announce roles as you navigate text, they cannot do so when announcing the contents of the live region, so if that's something you want, you'de need to build a micro screen-reader of sorts.

devinprater 6 days ago | parent | prev [-]

If it's command-line based, maybe stream based on lines, or even better, sentences rather than received tokens.

NoahZuniga 6 days ago | parent | prev | next [-]

Gemini 2.5 has the best vision understanding of any model I've worked with. Leagues beyond gpt5/o4

IanCal 6 days ago | parent | next [-]

It's hard to overstate this. They perform segmentation and masking and provide information from that to the model and it helps enormously.

Image understanding is still drastically lower than text performance, making glaring mistakes that are hard to understand but gemini 2.5 models are far and away the best in what I've tried.

pineaux 6 days ago | parent | prev | next [-]

Yeah i made a small app to sell my fathers books. I scanned all the books by making pictures of the book shelves + books (collection of 15k books almost all non-fiction). Then fed them to different AI's. Combining mistralOCR and Gemini worked very very good. I ran all the past both AIs and compared the output per book. Then saved all the output into an SQL for later reference. I did some other stuff with it, then made a document out of the output and sent it to a large group of book buyers. I asked them to bid on individual books and the whole collection.

devinprater 6 days ago | parent | prev | next [-]

There's a whole tool based on having Gemini 2.5 describe Youtube videos, OmniDescriber.

https://audioses.com/en/yazilimlar.php

johnfn 6 days ago | parent | prev [-]

Interesting -- what sort of things do you use it for?

devinprater 6 days ago | parent [-]

Having Youtube videos described to me, basically. Since Google won't do it.

jibal 6 days ago | parent | prev | next [-]

> > Let’s not mention the fact the ==> particular <== large language model, LLM called ==> Chat GPT <== they chose, was never the right kind of machine learning for the task of describing images.

> Weird. I would think LLMs are exactly the right kind of tool to describe images.

giancarlostoro 6 days ago | parent | prev [-]

> Weird. I would think LLMs are exactly the right kind of tool to describe images. Sadly there is no more detail about what they think would be a better approach.

Not sure but the Grok avatars or characters, whatever, I've experimented with them, though I hate the defaults that xAI made, because they seem to not be generic simple AI robot or w/e after you tell them to stop flirting and calling you babe (seriously what the heck lol) they can really hold a conversation. I talked to it about a musician I liked, very niche genre of music, and they were able to provide an insanely accurately relatable song from a different artist I did not know, all in real time.

I think it was last year or the year before? They did a demo where they had two phones, one could see, one could not, and the two ChatGPT instances were talking to each other, one was describing the room to the other. I think we are probably there by now to where you can describe a room.