Remix.run Logo
adishj 3 hours ago

hey, thanks for the comment!

we've actually found that multimodal models are surprisingly good at maintaining temporal context as well

that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding

for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits.