Take a look at JEPAs (Video Joint Embedding Predictive Architecture), SAM (Segment Anything), etc for Meta's latest research.
https://ai.meta.com/vjepa/
https://ai.meta.com/sam2/
https://ai.meta.com/research/