| ▲ | adishj 3 hours ago | |
hey, thanks for the comment! we've actually found that multimodal models are surprisingly good at maintaining temporal context as well that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits. | ||