Remix.run Logo
timmg 5 hours ago

(If you know) how does that work?

Are the audio/video/images tokenized the same way as text and then fed in as a stream? Or is the training objective different than "predict next token"?

If the former, do you think there are limitations to "stream of tokens"? Or is that essentially how humans work? (Like I think of our input as many-dimensional. But maybe it is compressed to a stream of tokens in part of our perception layer.)

johnb231 5 hours ago | parent [-]

Ask Gemini to explain how it was trained

https://g.co/gemini/share/f64c3358d9fa