Remix clone Hacker News

> More people need to get away from training their models just with words.

They started doing that a couple of years ago. The frontier "language" models are natively multimodal, trained on audio, text, video, images. That is all in the same model, not separate models stitched together. The inputs are tokenized and mapped into a shared embedding space.

Gemini, GPT-4o, Grok 3, Claude 3, Llama 4. These are all multimodal, not just "language models".

▲

timmg 6 hours ago | parent [-]

(If you know) how does that work?

Are the audio/video/images tokenized the same way as text and then fed in as a stream? Or is the training objective different than "predict next token"?

If the former, do you think there are limitations to "stream of tokens"? Or is that essentially how humans work? (Like I think of our input as many-dimensional. But maybe it is compressed to a stream of tokens in part of our perception layer.)

	▲	johnb231 6 hours ago \| parent [-]
		Ask Gemini to explain how it was trained https://g.co/gemini/share/f64c3358d9fa