Remix.run Logo
rahimnathwani 7 hours ago

For this use case, why not use Whisper to transcribe the audio, and then an LLM to do a second step (summarization or answering questions or whatever)?

If you need diarization, you can use something like https://github.com/m-bain/whisperX

pants2 7 hours ago | parent | next [-]

Whisper simply isn't very good compared to LLM audio transcription like gpt-4o-transcribe. If Gemini 3 is even better it's a game-changer.

crazysim 7 hours ago | parent | prev [-]

Since Gemini seems to be sucking at timestamps, perhaps Whisper can be used to help ground that as an additional input alongside the audio.