| ▲ | sailingparrot 4 hours ago | |
> you don't need to make a video model. You probably don't need to decode the latents at all. If you don't decode, how do you judge quality in a world where generative metrics are famously very hard and imprecise? How do you go about integrating RLHF/RLAF in your pipeline if you don't decode, which is not something you can skip anymore to get SotA? Just look at the companies that are explicitly aiming for robotics/simulation, they *are* doing video models. | ||