Remix clone Hacker News

	▲	timmg 5 hours ago
		(If you know) how does that work? Are the audio/video/images tokenized the same way as text and then fed in as a stream? Or is the training objective different than "predict next token"? If the former, do you think there are limitations to "stream of tokens"? Or is that essentially how humans work? (Like I think of our input as many-dimensional. But maybe it is compressed to a stream of tokens in part of our perception layer.)
	▲	johnb231 5 hours ago \| parent [-]
		Ask Gemini to explain how it was trained https://g.co/gemini/share/f64c3358d9fa