▲ | timmg 5 hours ago | |
(If you know) how does that work? Are the audio/video/images tokenized the same way as text and then fed in as a stream? Or is the training objective different than "predict next token"? If the former, do you think there are limitations to "stream of tokens"? Or is that essentially how humans work? (Like I think of our input as many-dimensional. But maybe it is compressed to a stream of tokens in part of our perception layer.) | ||
▲ | johnb231 5 hours ago | parent [-] | |
Ask Gemini to explain how it was trained |