Remix clone Hacker News

new | show | ask | jobs Github

	▲	danishSuri1994 5 hours ago
		Really interesting direction. The node-based canvas feels like a more scalable abstraction for video automation than the usual chat-only interface. I’m curious how you’re handling long-form content where temporal context matters (e.g., emotional shifts, pacing, narrative cues). Multimodal models are good at frame-level recognition, but editing requires understanding relationships between scenes, have you found any methods that work reliably there?
	▲	adishj 3 hours ago \| parent [-]
		hey, thanks for the comment! we've actually found that multimodal models are surprisingly good at maintaining temporal context as well that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits.