| ▲ | rfw300 3 hours ago | |
My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3: - Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts - Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes. - Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication. Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways. | ||
| ▲ | Rudybega an hour ago | parent | next [-] | |
I wonder if you could get around this with a slightly more sophisticated harness. I suspect you're running into context length issues. Something like 1.) Split audio into multiple smaller tracks. 2.) Perform first pass audio extraction 3.) Find unique speakers and other potentially helpful information (maybe just a short summary of where the conversation left off) 4.) Seed the next stage with that information (yay multimodality) and generate the audio transcript for it Obviously it would be ideal if a model could handle the ultra long context conversations by default, but I'd be curious how much error is caused by a lack of general capability vs simple context pollution. | ||
| ▲ | ant6n an hour ago | parent | prev [-] | |
The worst when it fails to eat simple pdf documents and lies and gas lights in an attempt to cover it up. Why not just admit you can’t read the file? | ||