Damn, you beat me to it. I was building something similar but got too caught up optimizing the context extraction. I actually ended up building a full spec for it—basically a PoC of "grep for videos."

My end goal was to let an agent make semantic changes (e.g., "remove the parts where the guy in the blue dress is seen") by simply grepping the context spec for the relevant timestamps and using ffmpeg to cut them out.

How are you extracting context from videos?

▲

adishj 5 hours ago | parent [-]

how would this be different from vector embeddings / semantic search?

▲

shambu2k 3 hours ago | parent [-]

Vector embeddings are fuzzy on finding boundaries. With my spec approach, my goal is to get precise start/end times for ffmpeg to do edits. The downside is, that there is a lot of pre-processing of raw footage in my approach. Vectors win on zero-shot flexibility here.

	▲	adishj 3 hours ago \| parent [-]
		if you have an example you could share i'd be very curious on what you mean.