Remix.run Logo
minimaxir 7 hours ago

Per the docs, Gemini represents each second of audio as 32 tokens. Since it's a consistent amount, as long as the model is trained to understand the relation between timestamps and the number of tokens (which per Simon's link it does), it should be able to infer the correct amount of seconds.