| ▲ | Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3(github.com) | |||||||||||||
| 74 points by petewarden 3 hours ago | 12 comments | ||||||||||||||
I wanted to share our new speech to text model, and the library to use them effectively. We're a small startup (six people, sub-$100k monthly GPU budget) so I'm proud of the work the team has done to create streaming STT models with lower word-error rates than OpenAI's largest Whisper model. Admittedly Large v3 is a couple of years old, but we're near the top the HF OpenASR leaderboard, even up against Nvidia's Parakeet family. Anyway, I'd love to get feedback on the models and software, and hear about what people might build with it. | ||||||||||||||
| ▲ | aplomb1026 2 minutes ago | parent | next [-] | |||||||||||||
The streaming latency numbers are what stand out to me here. Accuracy benchmarks get all the attention, but for real-time applications (voice assistants, live captioning, in-call transcription), the tail latency matters more than shaving a few points off WER. A 58M param model that can stream with sub-second latency on a Raspberry Pi opens up a whole class of edge applications that just aren't practical with larger models, even if those larger models score higher on static benchmarks. | ||||||||||||||
| ▲ | Karrot_Kream 36 minutes ago | parent | prev | next [-] | |||||||||||||
According to the OpenASR Leaderboard [1], looks like Parakeet V2/V3 and Canary-Qwen (a Qwen finetune) handily beat Moonshine. All 3 models are open, but Parakeet is the smallest of the 3. I use Parakeet V3 with Handy and it works great locally for me. [1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard | ||||||||||||||
| ▲ | ac29 an hour ago | parent | prev | next [-] | |||||||||||||
No idea why 'sudo pip install --break-system-packages moonshine-voice' is the recommended way to install on raspi? The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run) | ||||||||||||||
| ▲ | asqueella 40 minutes ago | parent | prev | next [-] | |||||||||||||
For those wondering about the language support, currently English, Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese are available (most in Base size = 58M params) | ||||||||||||||
| ▲ | armcat an hour ago | parent | prev | next [-] | |||||||||||||
This is awesome, well done guys, I’m gonna try it as my ASR component on the local voice assistant I’ve been building https://github.com/acatovic/ova. The tiny streaming latencies you show look insane | ||||||||||||||
| ▲ | pzo an hour ago | parent | prev | next [-] | |||||||||||||
haven't tested yet but I'm wondering how it will behave when talking about many IT jargon and tech acronyms. For those reason I had to mostly run LLM after STT but that was slowing done parakeet inference. Otherwise had problems to detect properly sometimes when talking about e.g. about CoreML, int8, fp16, half float, ARKit, AVFoundation, ONNX etc. | ||||||||||||||
| ▲ | g-mork an hour ago | parent | prev | next [-] | |||||||||||||
How does this compare to Parakeet, which runs wonderfully on CPU? | ||||||||||||||
| ▲ | sroussey 44 minutes ago | parent | prev | next [-] | |||||||||||||
onnx models for browser possible? | ||||||||||||||
| ▲ | lostmsu 2 hours ago | parent | prev | next [-] | |||||||||||||
How does it compare to Microsoft VibeVoice ASR https://news.ycombinator.com/item?id=46732776 ? | ||||||||||||||
| ▲ | cyanydeez 2 hours ago | parent | prev [-] | |||||||||||||
No LICENSE no go | ||||||||||||||
| ||||||||||||||