Remix.run Logo
andai 3 months ago

Didn't YouTube have auto-captions at the time this was discussed? Yeah they're a bit dodgy but I often watch videos in public with sound muted and 90% of the time you can guess what word it was meant to be from context. (And indeed more recent models do way, way, way better on accuracy.)

zehaeva 3 months ago | parent | next [-]

I have a few Deaf/Hard of Hearing friends who find the auto-captions to be basically useless.

Anything that's even remotely domain specific becomes a garbled mess. Even watching documentaries about light engineering/archeology/history subjects are hilariously bad. Names of historical places and people are randomly correct and almost always never consistent.

The second anyone has a bit of an accent then it's completely useless.

I keep them on partially because I'm of the "everything needs to have subtitles else I can't hear the words they're saying" cohort. So I can figure out what they really mean, but if you couldn't hear anything I can see it being hugely distracting/distressing/confusing/frustrating.

hunter2_ 3 months ago | parent | next [-]

With this context, it seems as though correction-by-LLM might be a net win among your Deaf/HoH friends even if it would be a net loss for you, since you're able to correct on the fly better than an LLM probably would, while the opposite is more often true for them, due to differences in experience with phonetics?

Soundex [0] is a prevailing method of codifying phonetic similarity, but unfortunately it's focused on names exclusively. Any correction-by-LLM really ought to generate substitution probabilities weighted heavily on something like that, I would think.

[0] https://en.wikipedia.org/wiki/Soundex

novok 3 months ago | parent | next [-]

You can also download the audio only with yt-dlp and then remake subs with whisper or whatever other model you want. GPU compute wise it will probably be less than asking an llm to try to correct a garbled transcript.

ldenoue 3 months ago | parent | next [-]

The current Flash-8B model I use costs $1 per 500 hours of transcript.

andai 3 months ago | parent [-]

If I read OpenAI's pricing right, then Google's thing is 200 times cheaper?

HPsquared 3 months ago | parent | prev [-]

I suppose the gold standard would be a multimodal model that also looks at the screen (maybe only if the captions aren't making much sense).

schrodinger 3 months ago | parent | prev | next [-]

I'd assume Soundex is too basic and English-centric to be a practical solution for an international company like Google. I was taught it and implemented it in a freshman level CS course in 2004, it can't be nearly state of the art!

shakna 3 months ago | parent | prev [-]

Soundex is fast, but inaccurate. It only prevails, because of the computational cost of things like levenshtein distance.

creato 3 months ago | parent | prev | next [-]

I use youtube closed captions all the time when I don't want to have audio. The captions are almost always fine. I definitely am not watching videos that would have had professional/human edited captions either.

There may be mistakes like the ones you mentioned (getting names wrong/inconsistent), but if I know what was intended, it's pretty easy to ignore that. I think expecting "textual" correctness is unreasonable. Usually when there are mistakes, they are "phonetic", i.e. if you spoke the caption out loud, it would sound pretty similar to what was spoken in the video.

dqv 3 months ago | parent | next [-]

> I think expecting "textual" correctness is unreasonable.

Of course you think that, you don't have to rely solely on closed captions! It's usually not even posed as an expectation, but as a request to correct captions that don't make sense. Especially now that we have auto-captioning and tools that auto-correct the captions, running through and tweaking them to near-perfect accuracy is not an undue burden.

> if you spoke the caption out loud, it would sound pretty similar to what was spoken in the video.

Yes, but most deaf people can't do that. Even if they can, they shouldn't have to.

beeboobaa6 3 months ago | parent [-]

There's helping people and there's infantilizing them. Being deaf doesn't mean you're stupid. They can figure it out.

Deleting thousands of hours of course material because you're worried they're not able to understand autogenerated captions just ensures everyone loses. Don't be so ridiculous.

mst 3 months ago | parent | prev [-]

They continue to be the worst automated transcripts I encounter and personally I find them sufficiently terribad that every time I try them I end up filing them under "nope, still more trouble than it's worth, gonna find a different source for this information and give them another go in six months."

Even mentally sounding them out (which is fine for me since I have no relevant disabilities, I just despise trying to take in any meaningful quantity of information from a video) when they look weird doesn't make them tolerable *for me*.

It's still a good thing overall that they're tolerable for you, though, and I hope other people are on average finding the experience closer to how you find it than how I find it ... but I definitely don't, yet.

Hopefully in a year or so I'll be in the same camp as you are, though, overall progress in the relevant class of tech seems to've hit a pretty decent velocity these days.

GaggiX 3 months ago | parent | prev | next [-]

Youtube captions have improved massively in recent years, they are flawless in most cases, sometimes a few errors (almost entirely in reporting numbers).

I think that the biggest problem is that the subtitles do not distinguish between the speakers.

ldenoue 3 months ago | parent | prev [-]

Definitely: and just giving the LLM context before correcting (in this case the title and description of the video, often written by a person) creates much better transcripts.

jonas21 3 months ago | parent | prev | next [-]

Yes, but the DOJ determined that the auto-generated captions were "inaccurate and incomplete, making the content inaccessible to individuals with hearing disabilities." [1]

If the automatically-generated captions are now of a similar quality as human-generated ones, then that changes things.

[1] https://news.berkeley.edu/wp-content/uploads/2016/09/2016-08...

jazzyjackson 3 months ago | parent | prev | next [-]

Definitely depends on audio quality and how closely a speaker's dialect matches the mid-atlantic accent, if you catch my drift.

IME youtube transcripts are completely devoid of meaningful information, especially when domain-specific vocabulary is used.

PeterStuer 3 months ago | parent | prev | next [-]

Youtube auto-captions are extremely poor compared to e.g. running the audio through Wisper.

cavisne 3 months ago | parent | prev [-]

What happened here is a specific scam where companies are targeted for ADA violations, which are so vague it’s impossible to “comply”.