Remix.run Logo
Workaccount2 7 hours ago

My assumption is that Gemini has no insight into the time stamps, and instead is ballparking it based on how much context has been analyzed up to that point.

I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.

simonw 7 hours ago | parent | next [-]

The Gemini documentation specifically mentions timestamp awareness here: https://ai.google.dev/gemini-api/docs/audio

minimaxir 7 hours ago | parent | prev [-]

Per the docs, Gemini represents each second of audio as 32 tokens. Since it's a consistent amount, as long as the model is trained to understand the relation between timestamps and the number of tokens (which per Simon's link it does), it should be able to infer the correct amount of seconds.