Remix clone Hacker News

What Carmack is doing is right. More people need to get away from training their models just with words. AI need the physicality.

▲

johnb231 3 hours ago | parent | next [-]

> More people need to get away from training their models just with words.

They started doing that a couple of years ago. The frontier "language" models are natively multimodal, trained on audio, text, video, images. That is all in the same model, not separate models stitched together. The inputs are tokenized and mapped into a shared embedding space.

Gemini, GPT-4o, Grok 3, Claude 3, Llama 4. These are all multimodal, not just "language models".

▲

timmg an hour ago | parent [-]

(If you know) how does that work?

Are the audio/video/images tokenized the same way as text and then fed in as a stream? Or is the training objective different than "predict next token"?

If the former, do you think there are limitations to "stream of tokens"? Or is that essentially how humans work? (Like I think of our input as many-dimensional. But maybe it is compressed to a stream of tokens in part of our perception layer.)

	▲	johnb231 an hour ago \| parent [-]
		Ask Gemini to explain how it was trained https://g.co/gemini/share/f64c3358d9fa

▲

NL807 9 hours ago | parent | prev | next [-]

>AI need the physicality.

which i found interesting, because i remember Carmack saying simulated environments are way forward and physical environments are too impractical for developing AI

	▲	SeanaldMcDnld 9 hours ago \| parent [-]
		Yeah in that way this demo seemed gimmicky like he acknowledged. He said in the past he would almost count people out if they weren’t training RL in a virtual environment. I agree, still happy he’s staying on the path of online continual learning though

▲

programd 8 hours ago | parent | prev [-]

Nvidia seems to think the same thing. Here's Jim Fan talking about a "physical Turing test" and how embodied AI is the way forward.

https://www.youtube.com/watch?v=_2NijXqBESI

He also talks needing large amounts of compute to run the virtual environments where you'll be training embodied AI. Very much worth watching.