▲ | lostmsu a day ago | ||||||||||||||||||||||||||||||||||||||||||||||
This is very impressive. Any details about model architecture and size? Input and output representation? How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech? | |||||||||||||||||||||||||||||||||||||||||||||||
▲ | sid-the-kid a day ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
For the input, we pass the model: 1) embedded audio and 2) a single image (encoded with a causal VAE). The model outputs the final RGB video directly. The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
▲ | lcolucci a day ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||
thank you! We have an architecture diagram and some more details in the tech report here: https://lemonslice.com/live/technical-report And yes, exactly. In between each character interaction we need to do speech-to-text, LLM, text-to-speech, and then our video model. All of it happens in a continuously streaming pipeline. |