| ▲ | avaer 7 hours ago | |
Soft disagree; if you wanted imagination you don't need to make a video model. You probably don't need to decode the latents at all. That seems pretty far from information-theoretic optimality, the kind that you want in a good+fast AI model making decisions. The whole reason for LLMs inferencing human-processable text, and "world models" inferencing human-interactive video, is precisely so that humans can connect in and debug the thing. I think the purpose of Genie is to be a video game, but it's a video game for AI researchers developing AIs. I do agree that the entertainment implications are kind of the research exhaust of the end goal. | ||
| ▲ | in-silico 7 hours ago | parent | next [-] | |
Sufficiently informative latents can be decoded into video. When you simulate a stream of those latents, you can decode them into video. If you were trying to make an impressive demo for the public, you probably would decode them into video, even if the real applications don't require it. Converting the latents to pixel space also makes them compatible with existing image/video models and multimodal LLMs, which (without specialized training) can't interpret the latents directly. | ||
| ▲ | NitpickLawyer 6 hours ago | parent | prev | next [-] | |
> I think the purpose of Genie is to be a video game, but it's a video game for AI researchers developing AIs. Yeah, I think this is what the person above was saying as well. This is what people at google have said already (a few podcasts on gdm's channel, hosted by Hannah Fry). They have their "agents" play in genie-powered environments. So one system "creates" the environment for the task. Say "place the ball in the basket". Genie creates an env with a ball and a basket, and the other agent learns to wasd its way around, pick up the ball and wasd to the basket, and so on. Pretty powerful combo if you have enough compute to throw at it. | ||
| ▲ | magospietato an hour ago | parent | prev | next [-] | |
I wonder what training insights could be gained by having proven general intelligences actively navigate a generative world model? | ||
| ▲ | sailingparrot 4 hours ago | parent | prev | next [-] | |
> you don't need to make a video model. You probably don't need to decode the latents at all. If you don't decode, how do you judge quality in a world where generative metrics are famously very hard and imprecise? How do you go about integrating RLHF/RLAF in your pipeline if you don't decode, which is not something you can skip anymore to get SotA? Just look at the companies that are explicitly aiming for robotics/simulation, they *are* doing video models. | ||
| ▲ | SequoiaHope 7 hours ago | parent | prev | next [-] | |
Didn’t the original world models paper do some training in latent space? (Edit: yes[1]) I think robots imagining the next step (in latent space) will be useful. It’s useful for people. A great way to validate that a robot is properly imagining the future is to make that latent space renderable in pixels. [1] “By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.” | ||
| ▲ | abraxas 4 hours ago | parent | prev | next [-] | |
> if you wanted imagination you don't need to make a video model. You probably don't need to decode the latents at all. Soft disagree. What is the purpose of that imagination if not to map it to actual real world outfcomes. For this to compare them to the real world and possibly backpropagate through them you'll need video frames. | ||
| ▲ | ACCount37 6 hours ago | parent | prev | next [-] | |
If you train a video model, you by necessity train a world model for 3D worlds. Which can then be reused in robotics, potentially. I do wonder if I can frankenstein together a passable VLA using pretrained LTX-2 as a base. | ||
| ▲ | thegabriele 6 hours ago | parent | prev | next [-] | |
Sure, but at some point you want humans in the loop i guess? | ||
| ▲ | koolala 6 hours ago | parent | prev | next [-] | |
What model do you need then? If you want 3D real-time understanding of how realities work? Are you focusing on "imagination" in a different abstract way? | ||
| ▲ | thegabriele 6 hours ago | parent | prev | next [-] | |
Sure, but at some point you want humans in the loop i guess? | ||
| ▲ | empath75 5 hours ago | parent | prev [-] | |
I am not sure we are at the "efficiency" phase of this. Even if you just wire this output (or probably multiples running different counterfactuals) into a multimodal LLM that interprets the video and uses it to make decisions, you have something new. | ||