Remix.run Logo
kmfrk a day ago

Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.

People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.

HOWTO:

Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.

EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.

EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.

https://www.nikse.dk/subtitleedit

https://www.nikse.dk/donate

https://github.com/SubtitleEdit/subtitleedit/releases

notatallshaw 20 hours ago | parent | next [-]

> uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):

> uv pip install torch torchvision torchaudio --torch-backend=auto

More details: https://docs.astral.sh/uv/guides/integration/pytorch/#automa...

This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.

xrd 19 hours ago | parent [-]

I love uv and really feel like I only need to know "uv add" and "uv sync" to be effective using it with python. That's an incredible feat.

But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.

The team at Astral should be nominated for a Nobel Peace Prize.

danudey 17 hours ago | parent | next [-]

> "uv add"

One life-changing thing I've been using `uv` for:

System python version is 3.12:

    $ python3 --version
    Python 3.12.3
A script that requires a library we don't have, and won't work on our local python:

    $ cat test.py
    #!/usr/bin/env python3

    import sys
    from rich import print

    if sys.version_info < (3, 13):
        print("This script will not work on Python 3.12")
    else:
        print(f"Hello world, this is python {sys.version}")
It fails:

    $ python3 test.py
    Traceback (most recent call last):
    File "/tmp/tmp/test.py", line 10, in <module>
        from rich import print
    ModuleNotFoundError: No module named 'rich'
Tell `uv` what our requirements are

    $ uv add --script=test.py --python '3.13' rich
    Updated `test.py`
`uv` updates the script:

    $ cat test.py
    #!/usr/bin/env python3
    # /// script
    # requires-python = ">=3.13"
    # dependencies = [
    #     "rich",
    # ]
    # ///

    import sys
    from rich import print

    if sys.version_info < (3, 13):
        print("This script will not work on Python 3.12")
    else:
        print(f"Hello world, this is python {sys.version}")
`uv` runs the script, after installing packages and fetching Python 3.13

    $ uv run test.py
    Downloading cpython-3.13.5-linux-x86_64-gnu (download) (33.8MiB)
    Downloading cpython-3.13.5-linux-x86_64-gnu (download)
    Installed 4 packages in 7ms
    Hello world, this is python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]
And if we run it with Python 3.12, we can see that errors:

    $ uv run --python 3.12 test.py
    warning: The requested interpreter resolved to Python 3.12.3, which is incompatible with the script's Python requirement: `>=3.13`
    Installed 4 packages in 7ms
    This script will not work on Python 3.12
Works for any Python you're likely to want:

    $ uv python list
    cpython-3.14.0b2-linux-x86_64-gnu                 <download available>
    cpython-3.14.0b2+freethreaded-linux-x86_64-gnu    <download available>
    cpython-3.13.5-linux-x86_64-gnu                   /home/dan/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin/python3.13
    cpython-3.13.5+freethreaded-linux-x86_64-gnu      <download available>
    cpython-3.12.11-linux-x86_64-gnu                  <download available>
    cpython-3.12.3-linux-x86_64-gnu                   /usr/bin/python3.12
    cpython-3.12.3-linux-x86_64-gnu                   /usr/bin/python3 -> python3.12
    cpython-3.11.13-linux-x86_64-gnu                  /home/dan/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bin/python3.11
    cpython-3.10.18-linux-x86_64-gnu                  /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/bin/python3.10
    cpython-3.9.23-linux-x86_64-gnu                   <download available>
    cpython-3.8.20-linux-x86_64-gnu                   <download available>
    pypy-3.11.11-linux-x86_64-gnu                     <download available>
    pypy-3.10.16-linux-x86_64-gnu                     <download available>
    pypy-3.9.19-linux-x86_64-gnu                      <download available>
    pypy-3.8.16-linux-x86_64-gnu                      <download available>
    graalpy-3.11.0-linux-x86_64-gnu                   <download available>
    graalpy-3.10.0-linux-x86_64-gnu                   <download available>
    graalpy-3.8.5-linux-x86_64-gnu                    <download available>
eigenvalue 18 hours ago | parent | prev | next [-]

They’ve definitely saved me many hours of wasted time between uv and ruff.

j45 9 hours ago | parent | prev [-]

Agreed, making the virtual environment management and so much else disappear lets so much more focus go to python itself.

tossit444 21 hours ago | parent | prev | next [-]

Aegisub is still actively developed (forked), and imo, both software can't really be compared to one another. They can complement each other, but SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like.

pawelduda 20 hours ago | parent | prev | next [-]

Can you give an example why it made your life that much better?

3036e4 16 hours ago | parent | next [-]

I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.

I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.

theshrike79 an hour ago | parent | next [-]

This, but I want a summary about the 3 hour video first before getting spending the time on it.

Download -> generate subtitles -> feed to AI for summary works pretty well

peterleiser 10 hours ago | parent | prev [-]

You'll probably like Whisper Live and it's browser extensions: https://github.com/collabora/WhisperLive?tab=readme-ov-file#...

Start playing a YouTube video in the browser, select "start capture" in the extension, and it starts writing subtitles in white text on a black background below the video. When you stop capturing you can download the subtitles as a standard .srt file.

kmfrk 20 hours ago | parent | prev | next [-]

Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.

But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.

Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.

There's also a great podcast app opportunity here I hope someone seizes.

shrx 20 hours ago | parent | prev | next [-]

As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.

dylan604 20 hours ago | parent | next [-]

IF the dialog is badly recorded or unintelligible speech, how would a transcription process get it correct?

gregoryl 19 hours ago | parent | next [-]

Because it can use the full set of information of the audio - people with hearing difficulties cannot. Also interesting, people with perfectly functional hearing, but whom have "software" bugs (i.e. I find it extremely hard to process voices with significant background nose) can also benefit :)

spauldo 16 hours ago | parent [-]

I have that issue as well - I can hear faint noises OK but if there's background noise I can't understand what people say. But I'm pretty sure there's a physical issue at the root of it in my case. The problem showed up after several practice sessions with a band whose guitarist insisted on always playing at full volume.

gregoryl 10 hours ago | parent | next [-]

I'd love your thoughts on why it might be hardware. I reason that my hearing is generally fine - there's no issue picking apart loud complex music (I love breakcore!).

But play two songs at the same time, or try talking to me with significant background noise, and I seem to be distinctly impaired vs. most others.

If I concentrate, I can sometimes work through it.

My uninformed model is a pipeline of sorts, and some sort of pre-processing isn't turned on. So the stuff after it has a much harder job.

spauldo 5 hours ago | parent [-]

I don't have much beyond what I said. It happened to me after repeated exposure to dangerously loud sounds in a small room. I can hear faint sounds, but I have trouble with strong accents and I can't understand words if there's a lot of background noise. I noticed it shortly after I left that band, and I left because the last practice was so loud it felt like a drill boring into my ears.

I don't think I have any harder time appreciating complex music than I did before, but I'm more of a 60s-70s rock kinda guy and a former bass player, so I tend to focus more on the low end. Bass tends to be less complex because you can't fit as much signal into the waveform without getting unpleasant muddling.

And of course, just because we have similar symptoms doesn't mean the underlying causes are the same. My grandfather was hard of hearing so for all I know it's genetic and the timing was a coincidence. Who knows?

dylan604 11 hours ago | parent | prev [-]

> I have that issue as well

You say issue, I say feature. It's a great way to just ignore boring babbling at parties or other social engagements where you're just not that engaged. Sort of like selective hearing in relationships, but used on a wider audience

enneff 11 hours ago | parent | next [-]

I don’t mean to speak for OP, but it strikes me as rude to make light of someone’s disability in this way. I’d guess it has caused them a lot of frustration.

dylan604 9 hours ago | parent [-]

Your assumption leads you to believe that I do not also suffer from the same issue. Ever since I was in a t-bone accident and the side airbag went off right next to my head, I have a definite issue hearing voices in crowded and noisy rooms with poor sound insulation. Some rooms are much worse than others.

So when I say I call it a feature, it's something I actually deal with unlike your uncharitable assumption.

jhy 2 hours ago | parent [-]

Sometimes, late at night when I'm trying to sleep, and I hear the grumble of a Harley, or my neighbors staggering to their door, I wonder: why do we not have earflaps, like we do eyelids?

spauldo 10 hours ago | parent | prev [-]

It's not so great when I'm standing right next to my technician in a pumphouse and I can't understand what he's trying to say to me.

mschuster91 19 hours ago | parent | prev [-]

The definition of "unintelligible" varies by person, especially by accent. Like, I got no problem with understanding the average person from Germany... but someone from the deep backwaters of Saxony, forget about that.

3036e4 17 hours ago | parent | prev [-]

I did this as recently as today, for that reason, using ffmpeg and whisper.cpp. But not on the fly. I ran it on a few videos to generate VTT files.

joshvm 14 hours ago | parent | prev [-]

I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.

10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.

You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.

Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.

taminka 20 hours ago | parent | prev | next [-]

whisper is great, i wonder why youtube's auto generated subs are still so bad? even the smallest whisper is way better than google's solution? is it licensing issue? harder to deploy at scale?

briansm 17 hours ago | parent | next [-]

I believe youtube still uses 40 mel-scale vectors as feature data, whisper uses 80 (which provides finer spectral detail but is computationally more intensive to process naturally, but modern hardware allows for that)

ec109685 11 hours ago | parent | prev [-]

You’d think they’d use the better model for at least videos that have a large view counts (they already do that when deciding compression optimizations).

BrunoJo 15 hours ago | parent | prev | next [-]

Subtitle Edit is great if you have the hardware to run it. If you don't have GPUs available or don't want to manage the servers I built a simple to use and affordable API that you can use: https://lemonfox.ai/

codedokode 18 hours ago | parent | prev | next [-]

Kdeenlive also supports auto-generating subtitles which need some editing, but it is faster than create them from scratch. Actually I would be happy even with a simple voice detector so that I don't have to set the timings manually.

kanemcgrath 13 hours ago | parent | prev | next [-]

Subtitle edit is great, and their subtitle library libse was exactly what I needed for a project I did.

throwoutway 19 hours ago | parent | prev | next [-]

I found this online demo of it: https://www.nikse.dk/subtitleedit/online

Morizero 15 hours ago | parent | prev | next [-]

You don't happen to know a whisper solution that combines diarization with live audio transcription, do you?

peterleiser 10 hours ago | parent | next [-]

Check out https://github.com/jhj0517/Whisper-WebUI

I ran it last night using docker and it worked extremely well. You need a HuggingFace read-only API token for the Diarization. I found that the web UI ignored the token, but worked fine when I added it to docker compose as an environment variable.

jduckles 15 hours ago | parent | prev | next [-]

WhipserX's diarization is great imo:

    whisperx input.mp3 --language en --diarize --output_format vtt --model large-v2
Works a treat for Zoom interviews. Diarization is sometimes a bit off, but generally its correct.
Morizero 14 hours ago | parent [-]

> input.mp3

Thanks but I'm looking for live diarization.

kmfrk 15 hours ago | parent | prev [-]

Proper diarization still remains a white whale for me, unfortunately.

Last I looked into it, the main options required API access to external services, which put me off. I think it was pyannotate.audio[1].

[1]: https://github.com/pyannote/pyannote-audio

peterleiser 10 hours ago | parent [-]

I used diarization in https://github.com/jhj0517/Whisper-WebUI last night and once it downloads the model from HuggingFace it runs offline (it claims).

jokethrowaway 19 hours ago | parent | prev | next [-]

whisper is definitely nice, but it's a bit too slow. Having subtitles and transcription for everything is great - but Nemo Parakeet (pretty much whisper by nvidia) completely changed how I interact with the computer.

It enables dictation that actually works and it's as fast as you can think. I also have a set of scripts which just wait for voice commands and do things. I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.

The main limitation is being english only.

threecheese 18 hours ago | parent | next [-]

Would you share the scripts?

ec109685 11 hours ago | parent [-]

Or at least more details. Very cool!

forgingahead 6 hours ago | parent | prev [-]

Yeah, mind sharing any of the scripts? I looked at the docs briefly, looks like we need to install ALL of nemo to get access to Parakeet? Seems ultra heavy.

rhdunn 3 hours ago | parent [-]

You only need the ASR bits -- this is where I got to when I previously looked into running Parakeet:

    # NeMo does not run on 3.13+
    python3.12 -m venv .venv
    source .venv/bin/activate

    git clone https://github.com/NVIDIA/NeMo.git nemo
    cd nemo

    pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu128
    pip install .[asr]

    deactivate
Then run a transcribe.py script in that venv:

    import os
    import sys
    import nemo.collections.asr as nemo_asr

    model_path = sys.argv[1]
    audio_path = sys.argv[2]

    # Load from a local path...
    asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(restore_path=model_path)

    # Or download from huggingface ('org/model')...
    asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name=model_path)

    output = asr_moel.transcribe([audio_path])
    print(output[0])
With that I was able to run the model, but I ran out of memory on my lower-spec laptop. I haven't yet got around to running it on my workstation.

You'll need to modify the python script to process the response and output it in a format you can use.

hart_russell 17 hours ago | parent | prev | next [-]

Is there a way to use it to generate a srt subtitle file given a video file?

prurigro 16 hours ago | parent [-]

It generates a few formats by default including srt

guluarte 16 hours ago | parent | prev | next [-]

you can install suing winget or chocolately

    winget install --id=Nikse.SubtitleEdit  -e
20 hours ago | parent | prev [-]
[deleted]