Remix.run Logo
gruez 3 days ago

> Limitations

>Timestamps/Speaker diarization. The model does not feature either of these.

What a shame. Is whisperx still the best choice if you want timestamps/diarization?

bartman 3 days ago | parent | next [-]

Even in the commercial space, there’s a lack of production grade ASR APIs that support diarization and word level timestamps.

My experiences with Google’s Chirp have been horrendous, with it sometimes skipping sections of speech entirely, hallucinating speech where the audio contains noise, and unreliable word level timestamps. And this all is even with using their new audio prefiltering feature.

AWS works slightly better, but also has trouble with keeping word level timestamps in sync.

Whisper is nice but hallucinates regularly.

OpenAI’s new transcription models are delivering accurate output but do not support word level timestamps…

A lot of this could be worked around by sending the resulting transcripts through a few layers of post processing, but… I just want to pay for an API that is reliable and saves me from doing all that work.

catlifeonmars 2 days ago | parent | next [-]

I wonder if you could run multiple models and average out the timestamps, kind of like how atomic clocks are used together and not separately

stavros 3 days ago | parent | prev [-]

Isn't Elevenlabs the best in this?

gardnr 2 days ago | parent | next [-]

They can have issues with the timestamps: https://github.com/elevenlabs/elevenlabs-python/issues/707

bartman 2 days ago | parent | prev [-]

I've not tested their speech-to-text yet, but based on the docs it looks promising. Thanks for the suggestion!

stavros 2 days ago | parent [-]

It's fantastic, and their diarization is spot on as well.

akreal 3 days ago | parent | prev | next [-]

WhisperX is not a model but a software package built around Whisper and some other models, including diarization and alignment ones. Something similar will be built around the Cohere Transcribe model, maybe even just an integration to WhisperX itself.

atoav 3 days ago | parent | prev | next [-]

I would try Qwen-ASR: https://qwen.ai/blog?id=qwen3asr

See the very bottom of the page for a transcription with timestamps.

mcbetz 2 days ago | parent | prev | next [-]

Mistral Voxtral has timestamps and diarization and does a good job for German (have not tested for other languages yet).

GaggiX 3 days ago | parent | prev | next [-]

There is also: https://github.com/linto-ai/whisper-timestamped

It doesn't use an extra model (so it supports every language that works with Whisper out of the box and use less memory), it works by applying Dynamic Time Warping to cross-attention weights.

oezi 3 days ago | parent [-]

Just a warning that plain WhisperX is more accurate and Whisper-timestamped has many weird quirks.

stavros 2 days ago | parent | prev | next [-]

Diarization is done separately to ASR anyway (it's usually a separate run, after the ASR).

lifesaverluke 3 days ago | parent | prev [-]

For podcasts there is this https://news.ycombinator.com/item?id=47584376

angel- 2 days ago | parent [-]

Link doesn't work for me, can you double check it please? Or tell the name of it so I can look it up? Thanks!

satvikpendem 2 days ago | parent [-]

Enable show dead in your HN profile settings. The link works then as it's a dead show HN post.