▲ | jc4p 4 days ago | |
Thank you so much!! While the transcription is technically in the API it's not a native part of the model and runs through Whisper separately, in my testing with it I often end up with a transcription that's a different language than what the user is speaking and the current API has no way to force a language on the internal Whisper call. If the language is correct, a lot of the times the exact text isn't 100% accurate, if that's 100% accurate, it comes in slower than the audio output and not in real time. All in all not what I would consider feature ready to release in my app. What I've been thinking about is switching to a full audio in --> transcribe --> send to LLM --> TTS pipeline, in which case I would be able to show the exact input to the model, but that's way more work than just one single OpenAI API call. | ||
▲ | pbbakkum 2 days ago | parent | next [-] | |
Heyo, I work on the realtime api, this is a very cool app! With transcription I would recommend trying out "gpt-4o-transcribe" or "gpt-4o-mini-transcribe" models, which will be more accurate than "whisper-1". On any model you can set the language parameter, see docs here: https://platform.openai.com/docs/api-reference/realtime-clie.... This doesn't guarantee ordering relative to the rest of the response, but the idea is to optimize for conversational-feeling latency. Hope this is helpful. | ||
▲ | valleyer 4 days ago | parent | prev [-] | |
Ah yes, I've seen that occasionally too, but it hasn't been a big enough issue for me to block adoption in a non-productized tool. I actually implemented the STT -> LLM -> TTS pipeline, too, and I allow users to switch between them. It's far less interactive, but it also gives much higher quality responses. Best of luck! |