Remix.run Logo
mistercheph 7 hours ago

For this use case I think best bet is still a toolchain with a transcription model like whisper fed into an LLM to summarize

simonw 7 hours ago | parent [-]

Yeah I agree. I ran Whisper (via MacWhisper) on the same video and got back accurate timestamps.

The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.

The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.