Remix.run Logo
pawelduda 20 hours ago

Can you give an example why it made your life that much better?

3036e4 16 hours ago | parent | next [-]

I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.

I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.

theshrike79 an hour ago | parent | next [-]

This, but I want a summary about the 3 hour video first before getting spending the time on it.

Download -> generate subtitles -> feed to AI for summary works pretty well

peterleiser 10 hours ago | parent | prev [-]

You'll probably like Whisper Live and it's browser extensions: https://github.com/collabora/WhisperLive?tab=readme-ov-file#...

Start playing a YouTube video in the browser, select "start capture" in the extension, and it starts writing subtitles in white text on a black background below the video. When you stop capturing you can download the subtitles as a standard .srt file.

kmfrk 20 hours ago | parent | prev | next [-]

Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.

But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.

Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.

There's also a great podcast app opportunity here I hope someone seizes.

shrx 20 hours ago | parent | prev | next [-]

As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.

dylan604 20 hours ago | parent | next [-]

IF the dialog is badly recorded or unintelligible speech, how would a transcription process get it correct?

gregoryl 19 hours ago | parent | next [-]

Because it can use the full set of information of the audio - people with hearing difficulties cannot. Also interesting, people with perfectly functional hearing, but whom have "software" bugs (i.e. I find it extremely hard to process voices with significant background nose) can also benefit :)

spauldo 16 hours ago | parent [-]

I have that issue as well - I can hear faint noises OK but if there's background noise I can't understand what people say. But I'm pretty sure there's a physical issue at the root of it in my case. The problem showed up after several practice sessions with a band whose guitarist insisted on always playing at full volume.

gregoryl 10 hours ago | parent | next [-]

I'd love your thoughts on why it might be hardware. I reason that my hearing is generally fine - there's no issue picking apart loud complex music (I love breakcore!).

But play two songs at the same time, or try talking to me with significant background noise, and I seem to be distinctly impaired vs. most others.

If I concentrate, I can sometimes work through it.

My uninformed model is a pipeline of sorts, and some sort of pre-processing isn't turned on. So the stuff after it has a much harder job.

spauldo 5 hours ago | parent [-]

I don't have much beyond what I said. It happened to me after repeated exposure to dangerously loud sounds in a small room. I can hear faint sounds, but I have trouble with strong accents and I can't understand words if there's a lot of background noise. I noticed it shortly after I left that band, and I left because the last practice was so loud it felt like a drill boring into my ears.

I don't think I have any harder time appreciating complex music than I did before, but I'm more of a 60s-70s rock kinda guy and a former bass player, so I tend to focus more on the low end. Bass tends to be less complex because you can't fit as much signal into the waveform without getting unpleasant muddling.

And of course, just because we have similar symptoms doesn't mean the underlying causes are the same. My grandfather was hard of hearing so for all I know it's genetic and the timing was a coincidence. Who knows?

dylan604 11 hours ago | parent | prev [-]

> I have that issue as well

You say issue, I say feature. It's a great way to just ignore boring babbling at parties or other social engagements where you're just not that engaged. Sort of like selective hearing in relationships, but used on a wider audience

enneff 11 hours ago | parent | next [-]

I don’t mean to speak for OP, but it strikes me as rude to make light of someone’s disability in this way. I’d guess it has caused them a lot of frustration.

dylan604 9 hours ago | parent [-]

Your assumption leads you to believe that I do not also suffer from the same issue. Ever since I was in a t-bone accident and the side airbag went off right next to my head, I have a definite issue hearing voices in crowded and noisy rooms with poor sound insulation. Some rooms are much worse than others.

So when I say I call it a feature, it's something I actually deal with unlike your uncharitable assumption.

jhy 2 hours ago | parent [-]

Sometimes, late at night when I'm trying to sleep, and I hear the grumble of a Harley, or my neighbors staggering to their door, I wonder: why do we not have earflaps, like we do eyelids?

spauldo 10 hours ago | parent | prev [-]

It's not so great when I'm standing right next to my technician in a pumphouse and I can't understand what he's trying to say to me.

mschuster91 19 hours ago | parent | prev [-]

The definition of "unintelligible" varies by person, especially by accent. Like, I got no problem with understanding the average person from Germany... but someone from the deep backwaters of Saxony, forget about that.

3036e4 17 hours ago | parent | prev [-]

I did this as recently as today, for that reason, using ffmpeg and whisper.cpp. But not on the fly. I ran it on a few videos to generate VTT files.

joshvm 14 hours ago | parent | prev [-]

I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.

10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.

You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.

Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.