You really want to break a task like this down to constituent parts - especially because in this case the "end to end" way of doing it (i.e., raw audio to summary) doesn't actually get you anything.

IMO the right way to do this is to feed the audio into a transcription model, specifically one that supports diarization (separation of multiple speakers). This will give you a high quality raw transcript that is pretty much exactly what was actually said.

It would be rough in places (i.e., Speaker 1, Speaker 2, etc. rather than actual speaker names)

Then you want to post-process with a LLM to re-annotate the transcript and clean it up (e.g., replace "Speaker 1" with "Mayor Bob"), and query against it.

I see another post here complaining that direct-to-LLM beats a transcription model like Whisper - I would challenge that. Any modern ASR model will do a very, very good job with 95%+ accuracy.

▲

simonw 6 hours ago | parent | next [-]

Which diarization models would you recommend, especially for running on macOS?

(Update: I just updated MacWhisper and it can now run Parakeet which appears to have decent diarization built in, screenshot here: https://static.simonwillison.net/static/2025/macwhisper-para... )

	▲	atonse 2 hours ago \| parent [-]
		I use parakeet daily (with MacWhisper) to transcribe my meetings. It works really well, even with the speaker segmentation.

▲

darkwater 6 hours ago | parent | prev | next [-]

Why can't Gemini, the product, do that by itself? Isn't the point of all this AI hype to easily automate things with low effort?

▲

vlovich123 5 hours ago | parent | next [-]

Multimodal models are only now starting to come into the space and even then I don’t know they really support diarization yet (and often multimodal is thinking+speech/images, not sure about audio).

	▲	jrk 4 hours ago \| parent [-]
		I think they weren’t asking “why can’t Gemini 3, the model, just do good transcription,” they were asking “why can’t Gemini, the API/app, recognize the task as something best solved not by a single generic model call, but by breaking it down into an initial subtask for a specialized ASR model followed by LLM cleanup, automatically, rather than me having to manually break down the task to achieve that result.”

▲

refulgentis 3 hours ago | parent | prev [-]

Speech recognition, as described above, is an AI too :) These LLMs are huge AIs that I guess could eventually replace all other AIs, but that’s sort of speculation no one with knowledge of the field would endorse.

Separately, in my role as wizened 16 year old veteran of HN: it was jarring to read that. There’s a “rules” section, but don’t be turned off by the name, it is more like a nice collection of guidelines of how to interact in a way that encourages productive discussion that illuminates. One of the key rules is not to interpret things weakly. Here, someone spelled out exactly how to do it, and we shouldn’t then assume its not AI, then tie to a vague demeaning description of “AI hype”, then ask an unanswerable question of what’s the point of “AI hype”.

If you’re nontechnical, to be clear, it would be hard to be nontechnical and new to HN and know how to ask that a different way, I suppose.

▲

sillyfluke 6 hours ago | parent | prev [-]

I'm curious when we started conflating transcription and summarization when discussing this LLM mess, or maybe I'm confused about the output simonw is quoting as "the transcript" which starts off not with the actual transcript but with a Meeting Outline and Summarization sections?

LLM summarization is utterly useless when you want 100% accuracy on the final binding decisions on things like council meeting decisions. My experience has been that LLMs cannot be trusted to follow convulted discussions, including revisting earlier agenda items later in the meeting etc.

With transcriptions, the catastrophic risk is far less since I'm doing the summarizing from a transcript myself. But in that case, for an auto-generated transcript, I'll take correct timestamps with gibberish sounding sentences over incorrect timestamps with "convincing" sounding but halluncinated sentences any day.

Any LLM summarization of a sufficiently important meeting requires second-by-second human verification of the audio recording. I have yet to see this convincingly refuted (ie, an LLM model that maintains 100% accuracy on summarizing meeting decisions consistently).

	▲	Royce-CMR 5 hours ago \| parent \| next [-]
		This is a high area of focus for me and I agree: following a complex convo, especially when it gets picked up again 20-30 min later, is difficult. But not impossible. I’ve had success with prompts that ID all topics and then map all conversation tied to each topic (each seperate LLM queries) and then pulling together summary and conclusions by topic. I’ve also had success with one shot prompts - especially with the right context on the event and phrasing shared. But honestly I end up spending about 5-10 min reviewing and cleaning up the output before solid. But that’s worlds better than attending the event, and then manually pulling together notes from your fast in flight shorthand. (Former BA, ran JADs etc, lived and died by accuracy and right color / expression / context in notes)
	▲	simonw 6 hours ago \| parent \| prev [-]
		That's why I shared these results. Understanding the difference between LLM summarization and exact transcriptions is really important for this kind of activity.