Remix clone Hacker News

new | show | ask | jobs Github

	▲	mistercheph 7 hours ago
		For this use case I think best bet is still a toolchain with a transcription model like whisper fed into an LLM to summarize
	▲	simonw 7 hours ago \| parent [-]
		Yeah I agree. I ran Whisper (via MacWhisper) on the same video and got back accurate timestamps. The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices. The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.