Remix clone Hacker News

new | show | ask | jobs Github

	▲	HanClinto 4 hours ago
		I absolutely love your approach of "expert tools". If I understand your approach, you aren't just feeding a video into a multimodal LLM and asking it "what is the bounding box of the optimal caption region?" -- you have built tools with discrete algorithms (using traditional CV techniques) that use things like object detection boxes + traditional motion analysis techniques to give "expert opinions" to the LLM in the form of tool calls -- such as finding the regions of minimal saliency + minimal movement to be the best places for caption placement. If the LLM needs to place captions, it calls one of these expert discrete-algorithm tools to determine the best place to put the captions -- you aren't just asking the LLM to do it on its own. If I'm correct about that, then I absolutely applaud you -- it feels like THIS is a fantastic model for how agentic tools should be built, and this is absolutely the opposite of AI slop. Kudos!
	▲	adishj 3 hours ago \| parent [-]
		thanks for the comment, thats exactly right we're using a mix of out-of-the-box multimodal AI capability + traditional audio / video analysis techniques as part of our video understanding pipeline, all of which become context for the agent to use during its editing process