As an aside, has anyone else had some big hallucinations with the Gemini meet summaries? Have been using it a week or so and loving the quality of the grammar of the summary etc, but noticed two recurring problems: omitting what was actually the most important point raised, and hallucinating things like “person x suggested y do z” when, really, that is absolutely the last thing x would really suggest!

▲

leetharris a year ago | parent | next [-]

The Google ASR is one of the worst on the internet. We run benchmarks of the entire industry regularly and the only hyperscaler with a good ASR is Azure. They acquired Nuance for $20b a while ago and they have a solid lead in the cloud space.

And to run it on a "free" product they probably use a very tiny, heavily quantized version of their already weak ASR.

There's lots and lots of better meeting bots if you don't mind paying or have low usage that works for a free tier. At Rev we give away something like 300 minutes a month.

▲

jll29 a year ago | parent | next [-]

Interesting. Do you have any peer reviewed scientific publications or technical reports regarding this work?

We also compared Amazon, Google, Microsoft Azure as well as a bunch of smaller players (from Edinburgh and Cambridge) and - consistent with what you reported - we also found Google ranked worst - but that was a one-off study from 2019 (unpublished) on financial news.

Word Error Rate (WER), the standard metric for the tast, is not everything. For some applications, the ability to upload custom lexicons is paramount (ASR systems that are word-based (almost all) as opposted to phoneme based require each word to be defined ahead of being able to recognize said word).

▲

aftbit a year ago | parent | prev | next [-]

Are there any self-hosted options that are even remotely competitive? I have tried Whisper2 a fair bit, and it seems to work okay in very clean situations, like adding subtitles to movie dialog, but not so well when dealing with multiple speakers or poor audio quality.

	▲	albertzeyer a year ago \| parent [-]
		K2/Kaldi is using more traditional ASR technology. It's probably more difficult to set up but you will more reliable outputs (no hallucinations or so).

▲

baxtr a year ago | parent | prev | next [-]

Very interesting. Thanks for sharing.

Since you have experience in this, I’d like to hear your thoughts on a common assumption.

It goes like this: don’t build anything that would be feature for a Hyperscalar because ultimately they win.

I guess a lot of it is a question of timing?

▲

leetharris a year ago | parent | next [-]

I think it really depends on whether or not you can offer a competitive solution and what your end goals are. Do you want an indie hacker business, do you want a lifestyle business, do you want a big exit, do you want to go public, etc?

It is hard to compete with these hyperscalers because they use pseudo anti-competitive tactics that honestly should be illegal.

For example, I know some ASR providers have lost deals to GCP or AWS because those providers will basically throw in ASR for free if you sign up for X amount of EC2 or Y amount of S3, services that have absurd margins for the cloud providers.

Still, stuff like Supabase, Twilio, etc show there is a market. But it's likely shrinking as consolidation continues, exits slow, and the DOJ turns a blind eye to all of this.

▲

hackernewds a year ago | parent | prev [-]

Counter argument: Zoom, DocuSign

But you do have to be next to amazing at execution

	▲	mst a year ago \| parent [-]
		I think those are cases of successfully becoming the company for the thing in the minds of decision makers before the hyperscalers decide to try and turn your product into a bundleable feature. Which is not to disagree with you, only to "yes, and" to emphasise that it's a fairly narrow path and 'amazing at execution' is necessary but not sufficient.

▲

a year ago | parent | prev | next [-]

[deleted]

▲

depr a year ago | parent | prev [-]

Have you tested their new Chirp v2 model? Curious if there's any improvement there.

>the only hyperscaler with a good ASR is Azure

How would you say the non-hyperscalers compare? Speechmatics for example?

▲

hunter2_ a year ago | parent | prev [-]

It can simultaneously be [the last thing x would suggest] and [a conclusion that an uninvolved person tasked with summarizing might mistakenly draw, with slightly higher probability of making this mistake than not making it] and theoretically an LLM attempts to output the latter. The same exact principle applies to missing the most important point.