Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

▲ Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark(simonwillison.net)

117 points by nabla9 7 hours ago | 45 comments

▲ simonw 6 hours ago | parent | next [-]

The audio transcript exercise here is particularly interesting from a journalism perspective.

Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.

I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):

  Output a Markdown transcript of this meeting. Include speaker
  names and timestamps. Start with an outline of the key
  meeting sections, each with a title and summary and timestamp
  and list of participating names. Note in bold if anyone
  raised their voices, interrupted each other or had
  disagreements. Then follow with the full transcript.

Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...

I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.

It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?

▲

luke-stanley an hour ago | parent | next [-]

Simon, sorry I didn't get around to answering your question on post-t5 encoder-decoders from the Markdown Lethal Trifecta prompt injection post. (https://news.ycombinator.com/item?id=45724941)

Since the plain decoder models stole the show, Google DeepMind demonstrated a way to adapt LLMs,adding a T5 encoder to an existing normal Gemma model to get the benefits of the more grounded text-to-text tasks WITHOUT instruction tuning (and the increased risk of prompt injection). They also have a few different kinds they shared on HuggingFace. I didn't get around to fine-tuning the weights of one for summarisation yet but it could well be a good way for more reliable summarisation. I did try out some models for inference though and made a Gist here, which is useful since I found the HF default code example a bit broken:

https://gist.github.com/lukestanley/ee89758ea315b68fd66ba52c...

Google's minisite: https://deepmind.google/models/gemma/t5gemma/ Paper: https://arxiv.org/abs/2504.06225

Here is one such model that didn't hallucinate and actually did summarise on HF: https://huggingface.co/google/t5gemma-l-l-prefixlm

▲

potatolicious 5 hours ago | parent | prev | next [-]

You really want to break a task like this down to constituent parts - especially because in this case the "end to end" way of doing it (i.e., raw audio to summary) doesn't actually get you anything.

IMO the right way to do this is to feed the audio into a transcription model, specifically one that supports diarization (separation of multiple speakers). This will give you a high quality raw transcript that is pretty much exactly what was actually said.

It would be rough in places (i.e., Speaker 1, Speaker 2, etc. rather than actual speaker names)

Then you want to post-process with a LLM to re-annotate the transcript and clean it up (e.g., replace "Speaker 1" with "Mayor Bob"), and query against it.

I see another post here complaining that direct-to-LLM beats a transcription model like Whisper - I would challenge that. Any modern ASR model will do a very, very good job with 95%+ accuracy.

▲

simonw 5 hours ago | parent | next [-]

Which diarization models would you recommend, especially for running on macOS?

(Update: I just updated MacWhisper and it can now run Parakeet which appears to have decent diarization built in, screenshot here: https://static.simonwillison.net/static/2025/macwhisper-para... )

	▲	atonse an hour ago \| parent [-]
		I use parakeet daily (with MacWhisper) to transcribe my meetings. It works really well, even with the speaker segmentation.

▲

darkwater 5 hours ago | parent | prev | next [-]

Why can't Gemini, the product, do that by itself? Isn't the point of all this AI hype to easily automate things with low effort?

▲

vlovich123 3 hours ago | parent | next [-]

Multimodal models are only now starting to come into the space and even then I don’t know they really support diarization yet (and often multimodal is thinking+speech/images, not sure about audio).

	▲	jrk 3 hours ago \| parent [-]
		I think they weren’t asking “why can’t Gemini 3, the model, just do good transcription,” they were asking “why can’t Gemini, the API/app, recognize the task as something best solved not by a single generic model call, but by breaking it down into an initial subtask for a specialized ASR model followed by LLM cleanup, automatically, rather than me having to manually break down the task to achieve that result.”

▲

refulgentis 2 hours ago | parent | prev [-]

Speech recognition, as described above, is an AI too :) These LLMs are huge AIs that I guess could eventually replace all other AIs, but that’s sort of speculation no one with knowledge of the field would endorse.

Separately, in my role as wizened 16 year old veteran of HN: it was jarring to read that. There’s a “rules” section, but don’t be turned off by the name, it is more like a nice collection of guidelines of how to interact in a way that encourages productive discussion that illuminates. One of the key rules is not to interpret things weakly. Here, someone spelled out exactly how to do it, and we shouldn’t then assume its not AI, then tie to a vague demeaning description of “AI hype”, then ask an unanswerable question of what’s the point of “AI hype”.

If you’re nontechnical, to be clear, it would be hard to be nontechnical and new to HN and know how to ask that a different way, I suppose.

▲

sillyfluke 5 hours ago | parent | prev [-]

I'm curious when we started conflating transcription and summarization when discussing this LLM mess, or maybe I'm confused about the output simonw is quoting as "the transcript" which starts off not with the actual transcript but with a Meeting Outline and Summarization sections?

LLM summarization is utterly useless when you want 100% accuracy on the final binding decisions on things like council meeting decisions. My experience has been that LLMs cannot be trusted to follow convulted discussions, including revisting earlier agenda items later in the meeting etc.

With transcriptions, the catastrophic risk is far less since I'm doing the summarizing from a transcript myself. But in that case, for an auto-generated transcript, I'll take correct timestamps with gibberish sounding sentences over incorrect timestamps with "convincing" sounding but halluncinated sentences any day.

Any LLM summarization of a sufficiently important meeting requires second-by-second human verification of the audio recording. I have yet to see this convincingly refuted (ie, an LLM model that maintains 100% accuracy on summarizing meeting decisions consistently).

	▲	Royce-CMR 4 hours ago \| parent \| next [-]
		This is a high area of focus for me and I agree: following a complex convo, especially when it gets picked up again 20-30 min later, is difficult. But not impossible. I’ve had success with prompts that ID all topics and then map all conversation tied to each topic (each seperate LLM queries) and then pulling together summary and conclusions by topic. I’ve also had success with one shot prompts - especially with the right context on the event and phrasing shared. But honestly I end up spending about 5-10 min reviewing and cleaning up the output before solid. But that’s worlds better than attending the event, and then manually pulling together notes from your fast in flight shorthand. (Former BA, ran JADs etc, lived and died by accuracy and right color / expression / context in notes)
	▲	simonw 5 hours ago \| parent \| prev [-]
		That's why I shared these results. Understanding the difference between LLM summarization and exact transcriptions is really important for this kind of activity.

▲

luke-stanley an hour ago | parent | prev | next [-]

I'd lower the temperature and try DSPy Refine loop on it (or similar). Using audio encoder-decoder models and segmentation are good things to try too. A length mismatch would be bad. DSPy has optimisers. It could probably optimise well with length match heuristic, there is probably a good Shannon entropy rule.

▲

byt3bl33d3r 6 hours ago | parent | prev | next [-]

I’ve been meaning to create & publish a structured extraction benchmark for a while. Using LLMs to extract info/entities/connections from large amounts of unstructured data is also a huge boon to AI-assisted reporting and has also a number of cybersecurity applications. Gemini 2.5 was pretty good but so far I have yet to see an LLM that can reliably , accurately and consistently do this

	▲	simonw 6 hours ago \| parent [-]
		This would be extremely useful. I think this is one of the most commercially valuable uses of these kinds of models, having more solid independent benchmarks would be great.

▲

rahimnathwani 6 hours ago | parent | prev | next [-]

For this use case, why not use Whisper to transcribe the audio, and then an LLM to do a second step (summarization or answering questions or whatever)?

If you need diarization, you can use something like https://github.com/m-bain/whisperX

	▲	pants2 6 hours ago \| parent \| next [-]
		Whisper simply isn't very good compared to LLM audio transcription like gpt-4o-transcribe. If Gemini 3 is even better it's a game-changer.
	▲	crazysim 6 hours ago \| parent \| prev [-]
		Since Gemini seems to be sucking at timestamps, perhaps Whisper can be used to help ground that as an additional input alongside the audio.

▲

WesleyLivesay 6 hours ago | parent | prev | next [-]

I think it appears to have done a good job of summarizing the points that it summarize, at least judging from my quick watch of a few sections and from the YT Transcript (which seems quite accurate).

Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.

I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.

▲

stavros 3 hours ago | parent | prev | next [-]

Just checking, but did you verify that the converted ffmpeg audio wasn't around an hour long? Maybe ffmpeg sped it up when converting?

	▲	simonw 2 hours ago \| parent [-]
		Here is the file I used, it is the expected length https://static.simonwillison.net/static/2025/HMB-nov-4-2025....

▲

Workaccount2 6 hours ago | parent | prev | next [-]

My assumption is that Gemini has no insight into the time stamps, and instead is ballparking it based on how much context has been analyzed up to that point.

I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.

	▲	simonw 6 hours ago \| parent \| next [-]
		The Gemini documentation specifically mentions timestamp awareness here: https://ai.google.dev/gemini-api/docs/audio
	▲	minimaxir 6 hours ago \| parent \| prev [-]
		Per the docs, Gemini represents each second of audio as 32 tokens. Since it's a consistent amount, as long as the model is trained to understand the relation between timestamps and the number of tokens (which per Simon's link it does), it should be able to infer the correct amount of seconds.

▲

ks2048 6 hours ago | parent | prev | next [-]

Does anyone benchmark these models for text-to-speech using traditional word-error-rates? It seems audio-input Gemini is a lot cheaper than Google Speech-to-text.

	▲	simonw 6 hours ago \| parent [-]
		Here's one: https://voicewriter.io/speech-recognition-leaderboard "Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.

▲

mistercheph 6 hours ago | parent | prev [-]

For this use case I think best bet is still a toolchain with a transcription model like whisper fed into an LLM to summarize

	▲	simonw 6 hours ago \| parent [-]
		Yeah I agree. I ran Whisper (via MacWhisper) on the same video and got back accurate timestamps. The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices. The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.

▲ Redster 20 minutes ago | parent | prev | next [-]

So Gemini 3 Pro dropped today, which happens to be the day I proofread a historical timeline I'm assisting a PhD with. I do one pass and then realize I should try Gemini 3 Pro on it. I give the same exact prompt to 3 Pro as Claude 4.5 Sonnet. 3 pro finds 25 real errors, no hallucinations. Claude finds 7 errors, but only 2 of those are unique to Claude. (Claude was better at "wait, that reference doesn't match the content! It should be $corrected_citation!). But Gemini's visual understanding was top notch. It's biggest flaw was that it saw words that wrapped as having extra spaces. But it also correctly caught a typo where a wrapped word was misspelled, so something about it seemed to fixate on those line breaks, I think.

	▲	Redster 19 minutes ago \| parent [-]
		A better test would be 2.5 Pro vs 3 Pro. Google just has been doing better at vision for a while.

▲ scosman 2 hours ago | parent | prev | next [-]

If anyone enjoys cheeky mischief targeting LLM benchmark hackers: https://github.com/scosman/pelicans_riding_bicycles

I'll need to update for V2!

	▲	yberreby 2 hours ago \| parent [-]
		Delightfully evil.

▲ Wowfunhappy 5 hours ago | parent | prev | next [-]

Aww, I don’t like the new pelican benchmark as much. I liked that the old prompt was vague and we could see how the AI interpreted it.

▲

ahmedfromtunis 4 hours ago | parent [-]

Yeah. The new challenge seems easier to solve since it basically is hand-holding the LLMs into what the result should look like.

I think a more challenging, well, challenge, would be to offer an even more absurd scenario and see how the model handles it.

Example: generate an svg of a pelican and a mongoose eating popcorn inside a pyramid-shaped vehicle flying around Jupiter. Result: https://imgur.com/a/TBGYChc

▲

simonw 4 hours ago | parent [-]

I like the hand-holding because it's a better test of how well models can follow more detailed instructions.

I was inspired by Max Woolf's nano banana test prompts: https://minimaxir.com/2025/11/nano-banana-prompts/

▲

ahmedfromtunis 4 hours ago | parent [-]

That's a valid point but I'd argue the new test would be then interesting to couple with the original one, not to replace it.

Do you think it would be reasonable to include both in future reviews, at least for the sake of back-compatibility (and comparability)?

	▲	simonw 4 hours ago \| parent [-]
		Yeah I'm going to keep on using the old one as well.

▲ londons_explore 6 hours ago | parent | prev | next [-]

Anyone got a class full of students and able to get a human version of this pelican benchmark?

Perhaps half with a web browser to view the results, and half working blind with the numbers alone?

	▲	roxolotl 3 hours ago \| parent [-]
		If a human was capable of doing it blind I’d be absolutely blown away. Many can’t even draw a bicycle from memory[0]. 0: https://www.wired.com/2016/04/can-draw-bikes-memory-definite...

▲ leetharris 5 hours ago | parent | prev | next [-]

I used to work in ASR. Due to the nature of current multimodal architectures, it is unlikely we'll ever see accurate timestamps over a longer horizon. You're better off using encoder-decoder ASR architectures, then using traditional diarization using embedding clustering, then using a multimodal model to refine it, then use a forced alignment technique (maybe even something pre-NN) to get proper timestamps and reconciling it at the end.

These things are getting really good at just regular transcription (as long as you don't care about verbatimicity), but every additional dimension you add (timestamps, speaker assignment, etc) will make the others worse. These work much better as independent processes that then get reconciled and refined by a multimodal LLM.

▲ vessenes an hour ago | parent | prev | next [-]

I'd like a long bet on when Simon will add animation to the pelican SVG benchmark. I'm thinking late 2026.

▲ ZeroConcerns 6 hours ago | parent | prev | next [-]

> so I shrunk the file down to a more manageable 38MB using ffmpeg

Without having an LLM figure out the required command line parameters? Mad props!

	▲	simonw 5 hours ago \| parent [-]
		Hah, nope! I had Claude Code figure that one out.

▲ nurumaik 5 hours ago | parent | prev | next [-]

Seems like pelican benchmark is finally added to model training process

▲ razodactyl 3 hours ago | parent | prev [-]

I was waiting for this post.

Love the pivot in pelican generation bench.