Remix clone Hacker News

new | show | ask | jobs Github

	▲	vlovich123 5 hours ago
		Multimodal models are only now starting to come into the space and even then I don’t know they really support diarization yet (and often multimodal is thinking+speech/images, not sure about audio).
	▲	jrk 4 hours ago \| parent [-]
		I think they weren’t asking “why can’t Gemini 3, the model, just do good transcription,” they were asking “why can’t Gemini, the API/app, recognize the task as something best solved not by a single generic model call, but by breaking it down into an initial subtask for a specialized ASR model followed by LLM cleanup, automatically, rather than me having to manually break down the task to achieve that result.”