Remix.run Logo
wood_spirit 6 hours ago

As an aside, has anyone else had some big hallucinations with the Gemini meet summaries? Have been using it a week or so and loving the quality of the grammar of the summary etc, but noticed two recurring problems: omitting what was actually the most important point raised, and hallucinating things like “person x suggested y do z” when, really, that is absolutely the last thing x would really suggest!

leetharris 6 hours ago | parent | next [-]

The Google ASR is one of the worst on the internet. We run benchmarks of the entire industry regularly and the only hyperscaler with a good ASR is Azure. They acquired Nuance for $20b a while ago and they have a solid lead in the cloud space.

And to run it on a "free" product they probably use a very tiny, heavily quantized version of their already weak ASR.

There's lots and lots of better meeting bots if you don't mind paying or have low usage that works for a free tier. At Rev we give away something like 300 minutes a month.

jll29 an hour ago | parent | next [-]

Interesting. Do you have any peer reviewed scientific publications or technical reports regarding this work?

We also compared Amazon, Google, Microsoft Azure as well as a bunch of smaller players (from Edinburgh and Cambridge) and - consistent with what you reported - we also found Google ranked worst - but that was a one-off study from 2019 (unpublished) on financial news.

Word Error Rate (WER), the standard metric for the tast, is not everything. For some applications, the ability to upload custom lexicons is paramount (ASR systems that are word-based (almost all) as opposted to phoneme based require each word to be defined ahead of being able to recognize said word).

baxtr 5 hours ago | parent | prev | next [-]

Very interesting. Thanks for sharing.

Since you have experience in this, I’d like to hear your thoughts on a common assumption.

It goes like this: don’t build anything that would be feature for a Hyperscalar because ultimately they win.

I guess a lot of it is a question of timing?

leetharris 4 hours ago | parent | next [-]

I think it really depends on whether or not you can offer a competitive solution and what your end goals are. Do you want an indie hacker business, do you want a lifestyle business, do you want a big exit, do you want to go public, etc?

It is hard to compete with these hyperscalers because they use pseudo anti-competitive tactics that honestly should be illegal.

For example, I know some ASR providers have lost deals to GCP or AWS because those providers will basically throw in ASR for free if you sign up for X amount of EC2 or Y amount of S3, services that have absurd margins for the cloud providers.

Still, stuff like Supabase, Twilio, etc show there is a market. But it's likely shrinking as consolidation continues, exits slow, and the DOJ turns a blind eye to all of this.

hackernewds an hour ago | parent | prev [-]

Counter argument: Zoom, DocuSign

But you do have to be next to amazing at execution

2 hours ago | parent | prev | next [-]
[deleted]
aftbit 3 hours ago | parent | prev [-]

Are there any self-hosted options that are even remotely competitive? I have tried Whisper2 a fair bit, and it seems to work okay in very clean situations, like adding subtitles to movie dialog, but not so well when dealing with multiple speakers or poor audio quality.

albertzeyer 3 hours ago | parent [-]

K2/Kaldi is using more traditional ASR technology. It's probably more difficult to set up but you will more reliable outputs (no hallucinations or so).

hunter2_ 6 hours ago | parent | prev [-]

It can simultaneously be [the last thing x would suggest] and [a conclusion that an uninvolved person tasked with summarizing might mistakenly draw, with slightly higher probability of making this mistake than not making it] and theoretically an LLM attempts to output the latter. The same exact principle applies to missing the most important point.