| ▲ | rahimnathwani 7 hours ago | |
For this use case, why not use Whisper to transcribe the audio, and then an LLM to do a second step (summarization or answering questions or whatever)? If you need diarization, you can use something like https://github.com/m-bain/whisperX | ||
| ▲ | pants2 7 hours ago | parent | next [-] | |
Whisper simply isn't very good compared to LLM audio transcription like gpt-4o-transcribe. If Gemini 3 is even better it's a game-changer. | ||
| ▲ | crazysim 7 hours ago | parent | prev [-] | |
Since Gemini seems to be sucking at timestamps, perhaps Whisper can be used to help ground that as an additional input alongside the audio. | ||