| ▲ | Packing Input Frame Context in Next-Frame Prediction Models for Video Generation(lllyasviel.github.io) |
| 269 points by GaggiX 4 days ago | 28 comments |
| |
|
| ▲ | Jaxkr 4 days ago | parent | next [-] |
| This guy is a genius; for those who don’t know he also brought us ControlNet. This is the first decent video generation model that runs on consumer hardware. Big deal and I expect ControlNet pose support soon too. |
| |
| ▲ | artninja1988 3 days ago | parent | next [-] | | He also brought us IC-Light! I wonder why he's still contributing to open source... Surely all the big companies have made him huge offers. He's so talented | | |
| ▲ | dragonwriter 3 days ago | parent [-] | | I think he is working on his Ph.D. at Stanford. I assume whatever offers he has haven't been attractive enough to abandon that, whether he’ll still be doing open work or get sucked into the bowels of some proprietary corporate behemoth afterwards remains to be seen, but I suspect he won't have trouble monetizing his skills either way. |
| |
| ▲ | msp26 3 days ago | parent | prev [-] | | I haven't bothered with video gen because I'm too impatient but isn't Wan pretty good too on regular hardware? | | |
| ▲ | dragonwriter 3 days ago | parent | next [-] | | Wan 2.1 (and Hunyuan and LTXV, in descending ordee of overall video quality but each has unique strengths) work well—but slow, except LTXV—for short (single digit seconds at their usual frame rates — 16 for WAN, 24 for LXTV, I forget for Hunyuan) videos on consumer hardware. But this blows them entirely out of the water on the length it can handle, so if it does so with coherence and quality across general prompts (especially if it is competitive with WAN and Hunyuan on trainability for concepts it may not handle normally) it is potentially a radical game changer. | | |
| ▲ | dragonwriter 3 days ago | parent [-] | | For completeness, I should note I'm talking about the 14B i2v and t2v WAN 2.1 models; there are others in the family, notably a set of 1.3B models that are presumably much faster, but I haven't worked with them as much |
| |
| ▲ | dewarrn1 3 days ago | parent | prev | next [-] | | LTX-Video isn't quite the same quality as Wan, but the new distilled 0.9.6 version is pretty good and screamingly fast. https://github.com/Lightricks/LTX-Video | |
| ▲ | vunderba 3 days ago | parent | prev [-] | | Wan 2.1 is solid but you start to get pretty bad continuity / drift issues when genning more than 81 frames (approx 5 seconds of video) whereas FramePack lets you generate 1+ minute. |
|
|
|
| ▲ | IshKebab 4 days ago | parent | prev | next [-] |
| Funny how it really wants people to dance. Even the guy sitting down for an interview just starts dancing sitting down. |
| |
| ▲ | jonas21 3 days ago | parent | next [-] | | Presumably they're dancing because it's in the prompt. You could change the prompt to have them do something else (but that would be less fun!) | | |
| ▲ | IshKebab 3 days ago | parent [-] | | I'm no expert but are you sure there is a prompt? | | |
| ▲ | dragonwriter 3 days ago | parent [-] | | Yes, while the page here does not directly mention the prompts, the linked paper does, and the linked code repo shows that prompts are used as well. | | |
| ▲ | vunderba 3 days ago | parent | next [-] | | 100%. I don't think I've ever even come across an I2V model that didn't require at least a positive prompt. Some people get around it by integrating a vision LLM into their ComfyUI workflows however. | |
| ▲ | IshKebab 3 days ago | parent | prev [-] | | Ah yeah you're right - they seem to just really like giving dancing prompts. I guess they work well due to the training set. |
|
|
| |
| ▲ | Jaxkr 4 days ago | parent | prev | next [-] | | Massive open TikTok training set lots of video researchers use | |
| ▲ | bravura 3 days ago | parent | prev [-] | | It's a peculiar and fascinating observation you make. With static images, we always look for eyes. With video, we always look for dancing. |
|
|
| ▲ | ZeroCool2u 4 days ago | parent | prev | next [-] |
| Wow, the examples are fairly impressive and the resources used to create them are practically trivial. Seems like inference can be run on previous generation consumer hardware. I'd like to see throughput stats for inference on a 5090 too at some point. |
|
| ▲ | WithinReason 4 days ago | parent | prev | next [-] |
| Could you do this spatially as well? E.g. generate the image top-down instead of all at once |
| |
|
| ▲ | modeless 4 days ago | parent | prev | next [-] |
| Could this be used for video interpolation instead of extrapolation? |
| |
| ▲ | yorwba 3 days ago | parent [-] | | Their "inverted anti-drifting" basically amounts to first extrapolating a lot and then interpolating backwards. |
|
|
| ▲ | ilaksh 3 days ago | parent | prev | next [-] |
| Amazing.
If you have more RAM or something, can it go faster? Can you get even more speed on an H100 or H200? |
|
| ▲ | fregocap 4 days ago | parent | prev [-] |
| looks like the only motion it can do...is to dance |
| |
| ▲ | jsolson 3 days ago | parent | next [-] | | It can dance if it wants to... It can leave LLMs behind... 'Cause LLMs don't dance, and if they don't dance, well, they're no friends of mine. | | |
| ▲ | MyOutfitIsVague 3 days ago | parent | next [-] | | The AI Safety dance? | |
| ▲ | rhdunn 3 days ago | parent | prev [-] | | That's a certified bop! ;) You should get elybeatmaker to do a remix! Edit: I didn't realize that this was actually a reference to Men Without Hats - The Safety Dance. I was referencing a different parody/allusion to that song! |
| |
| ▲ | dragonwriter 3 days ago | parent | prev | next [-] | | There is plenty of non-dance motion (only one or two where its non-dance foot motion, but feet aren't the only things that move.) | |
| ▲ | enlyth 3 days ago | parent | prev [-] | | It takes a text prompt along with the image input, dancing is presumably what they've used for the examples |
|