| ▲ | HanClinto 4 hours ago | |
I absolutely love your approach of "expert tools". If I understand your approach, you aren't just feeding a video into a multimodal LLM and asking it "what is the bounding box of the optimal caption region?" -- you have built tools with discrete algorithms (using traditional CV techniques) that use things like object detection boxes + traditional motion analysis techniques to give "expert opinions" to the LLM in the form of tool calls -- such as finding the regions of minimal saliency + minimal movement to be the best places for caption placement. If the LLM needs to place captions, it calls one of these expert discrete-algorithm tools to determine the best place to put the captions -- you aren't just asking the LLM to do it on its own. If I'm correct about that, then I absolutely applaud you -- it feels like THIS is a fantastic model for how agentic tools should be built, and this is absolutely the opposite of AI slop. Kudos! | ||
| ▲ | adishj 3 hours ago | parent [-] | |
thanks for the comment, thats exactly right we're using a mix of out-of-the-box multimodal AI capability + traditional audio / video analysis techniques as part of our video understanding pipeline, all of which become context for the agent to use during its editing process | ||