Remix.run Logo
dfajgljsldkjag 13 hours ago

I am always skeptical of benchmarks that show perfect scores, especially when they come from the company selling the product. It feels like everyone claims to have solved conversational timing these days. I guess we will see if it is actually any good.

bpanahij 2 hours ago | parent | next [-]

You should be skeptical, and try it out. I selected 28 long conversations for our evaluation set, all unseen audio. Every turn taking model makes tradeoffs, and I tried to make the best tradeoffs for each model by adjusting and tuning the implementations. I’m certainly not in a position as the creator of Sparrow to be totally objective. However we did use unaltered real conversational audio to evaluate. I tried to find examples that would challenge Sparrow-1 with lots of variation in speaker style across the conversations.

fudged71 13 hours ago | parent | prev [-]

Different industry, but our marketing guy once said "You know what this [perfect] metric means? We can never use it in marketing because it's not believable"

khalic 12 hours ago | parent [-]

Just include some noise, it’s like the most available resource in the universe

drob518 7 hours ago | parent [-]

Never thought of noise as a resource, but yea.