| ▲ | antirez 6 hours ago | |
Hi! This model is great, but it is too big for local inference, Whisper medium (the "base" IMHO is not usable for most things, and "large" is too large) is a better deal for many environments, even if the transcription quality is noticeable lower (and even if it does not have a real online mode). But... It's time for me to check the new Qwen 0.6 transcription model. If it works as well as their benchmarks claim, that could be the target for very serious optimizations and a no deps inference chain conceived since the start for CPU execution, not just for MPS. Since, many times, you want to install such transcription systems on server rent online via Hetzner and other similar vendors. So I'm going to handle it next, and if it delivers, really, time for big optimizations covering specifically the Intel, AMD and ARM instructions sets, potentially also thinking at 8bit quants if the performance remain good. | ||
| ▲ | dust42 6 hours ago | parent [-] | |
Same experience here with Whisper, medium is often not good enough. The large-turbo model however is pretty decent and on Apple silicon fast enough for real time conversations. The addition of the prompt parameter can also help with transcription quality, especially when using domain specific vocabulary. In general Whisper.cpp is better with transcribing full phrases than with streaming. And not to forget, for many use cases more than just English is needed. Unfortunately right now most STT/ASR and TTS focus on English plus 0-10 other languages. Thus being able to add with reasonable effort more languages or domain specific vocabulary would be a huge plus for any STT and TTS. | ||