| ▲ | echelon 7 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Image models are more fundamentally important at this stage than video models. Almost all of the control in image-to-video comes through an image. And image models still needs a lot of work and innovation. On a real physical movie set, think about all of the work that goes into setting the stage. The set dec, the makeup, the lighting, the framing, the blocking. All the work before calling "action". That's what image models do and must do in the starting frame. We can get way more influence out of manipulating images than video. There are lots of great video models and it's highly competitive. We still have so much need on the image side. When you do image-to-video, yes you control evolution over time. But the direction is actually lower in terms of degrees of freedom. You expect your actors or explosions to do certain reasonable things. But those 1024x1024xRGB pixels (or higher) have way more degrees of freedom. Image models have more control surface area. You exercise control over more parameters. In video, staying on rails or certain evolutionary paths is fine. Mistakes can not just be okay, they can be welcome. It also makes sense that most of the work and iteration goes into generating images. It's a faster workflow with more immediate feedback and productivity. Video is expensive and takes much longer. Images are where the designer or director can influence more of the outcomes with rapidity. Image models still need way more stylistic control, pose control (not just ControlNets for limbs, but facial expressions, eyebrows, hair - everything), sets, props, consistent characters and locations and outfits. Text layout, fonts, kerning, logos, design elements, ... We still don't have models that look as good as Midjourney. Midjourney is 100x more beautiful than anything else - it's like a magazine photoshoot or dreamy Instagram feed. But it has the most lackluster and awful control of any model. It's a 2021-era model with 2030-level aesthetics. You can't place anything where you want it, you can't reuse elements, you can't have consistent sets... But it looks amazing. Flux looks like plastic, Imagen looks cartoony, and OpenAI GPT Image looks sepia and stuck in the 90's. These models need to compete on aesthetics and control and reproducibility. That's a lot of work. Video is a distraction from this work. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | cubefox 6 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hot take: text-to-image models should be biased toward photorealism. This is because if I type in "a cat playing piano", I want to see something that looks like a 100% real cat playing a 100% real piano. Because, unless specified otherwise, a "cat" is trivially something that looks like an actual cat. And a real cat looks photorealistic. Not like a painting, or cartoon, or 3D render, or some fake almost-realistic-but-cleary-wrong "AI style". | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||