| ▲ | Show HN: Multimodal perception system for real-time conversation(raven.tavuslabs.org) | |
| 24 points by mert_gerdan 3 hours ago | 2 comments | ||
I work on real-time voice/video AI at Tavus and for the past few years, I’ve mostly focused on how machines respond in a conversation. One thing that’s always bothered me is that almost all conversational systems still reduce everything to transcripts, and throw away a ton of signals that need to be used downstream. Some existing emotion understanding models try to analyze and classify into small sets of arbitrary boxes, but they either aren’t fast / rich enough to do this with conviction in real-time. So I built a multimodal perception system which gives us a way to encode visual and audio conversational signals and have them translated into natural language by aligning a small LLM on these signals, such that the agent can "see" and "hear" you, and that you can interface with it via an OpenAI compatible tool schema in a live conversation. It outputs short natural language descriptions of what’s going on in the interaction - things like uncertainty building, sarcasm, disengagement, or even shift in attention of a single turn in a convo. Some quick specs: - Runs in real-time per conversation - Processing at ~15fps video + overlapping audio alongside the conversation - Handles nuanced emotions, whispers vs shouts - Trained on synthetic + internal convo data Happy to answer questions or go deeper on architecture/tradeoffs More details here: https://www.tavus.io/post/raven-1-bringing-emotional-intelli... | ||
| ▲ | ashishheda 9 minutes ago | parent | next [-] | |
Wooo, this is amazing! | ||
| ▲ | jesserowe 3 hours ago | parent | prev [-] | |
the demo is wild... kudos | ||