Remix clone Hacker News

new | show | ask | jobs Github

▲

lostmsu a day ago

This is very impressive. Any details about model architecture and size? Input and output representation?

How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech?

▲

sid-the-kid a day ago | parent | next [-]

For the input, we pass the model: 1) embedded audio and 2) a single image (encoded with a causal VAE). The model outputs the final RGB video directly.

The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion.

▲

tough 21 hours ago | parent | next [-]

I'm not at that level but reminded me of https://news.ycombinator.com/item?id=43736193

▲

sid-the-kid 21 hours ago | parent [-]

Nice find! I hand't seen this before (and will take a deeper look later). It looks like this is an approach to better utilize the GPU memory. And, we would probably benefit from this to get more of a speed-up, which would also help us get better video quality.

I do not think they are running in real time though. From the website: "Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache)." That means it would take them 37.5s to generate 1 second of video, which is fast for video but way slower than real time.

	▲	tough 21 hours ago \| parent [-]
		Yep, this is way slower but considered SOTA on video-gen open source. I mostly meant the using the previous frames to generate new frames insight that reminded me but lack knowledge on the specifics of the work glad if its useful for your work/research to check out the paper edit: the real-time-ness of it also has to have into equation what HW are you running your model on, obviously easier to make so on a H100 than a 3090, but these memory optimizations really help to make these models usable at all for local stuff, which is a great win i think for overall adoption/further stuff being build upon them a bit like sd-webui from automatic1111 alongside stable diffusion weights models being open sourced was a boom on image gen a couple years back

▲

tony_cannistra 18 hours ago | parent | prev | next [-]

Nice. This is also how recent advances in ML weather forecasting work. Weather forecasting really is just "video generation" but in higher dimensions.

▲

dheera 19 hours ago | parent | prev [-]

Nice! What infra do you use for inference? I'm wondering what the cost-effective platforms are for projects like this. GPUs on AWS and Azure are incredibly expensive for personal use.

	▲	sid-the-kid 18 hours ago \| parent [-]
		We use modal (https://modal.com/). They give us GPUs on-demand, which is critical for us so we are only paying for what we are using. Pricing is about $2/hr per GPU (as a baseline of the costs). Long story short, things get VERY expensive quickly.

▲

lcolucci a day ago | parent | prev [-]

thank you! We have an architecture diagram and some more details in the tech report here: https://lemonslice.com/live/technical-report

And yes, exactly. In between each character interaction we need to do speech-to-text, LLM, text-to-speech, and then our video model. All of it happens in a continuously streaming pipeline.