Remix.run Logo
valleyer 4 days ago

This is great! Well done.

I've used the realtime API for something similar (also related to practicing speaking, though not for foreign languages). I just wanted to comment that the realtime API will definitely give you the user's transcriptions -- they come back as an `server.conversation.item.input_audio_transcription.completed` event. I use it in my app for exactly that purpose.

jc4p 4 days ago | parent [-]

Thank you so much!! While the transcription is technically in the API it's not a native part of the model and runs through Whisper separately, in my testing with it I often end up with a transcription that's a different language than what the user is speaking and the current API has no way to force a language on the internal Whisper call.

If the language is correct, a lot of the times the exact text isn't 100% accurate, if that's 100% accurate, it comes in slower than the audio output and not in real time. All in all not what I would consider feature ready to release in my app.

What I've been thinking about is switching to a full audio in --> transcribe --> send to LLM --> TTS pipeline, in which case I would be able to show the exact input to the model, but that's way more work than just one single OpenAI API call.

pbbakkum 2 days ago | parent | next [-]

Heyo, I work on the realtime api, this is a very cool app!

With transcription I would recommend trying out "gpt-4o-transcribe" or "gpt-4o-mini-transcribe" models, which will be more accurate than "whisper-1". On any model you can set the language parameter, see docs here: https://platform.openai.com/docs/api-reference/realtime-clie.... This doesn't guarantee ordering relative to the rest of the response, but the idea is to optimize for conversational-feeling latency. Hope this is helpful.

valleyer 4 days ago | parent | prev [-]

Ah yes, I've seen that occasionally too, but it hasn't been a big enough issue for me to block adoption in a non-productized tool.

I actually implemented the STT -> LLM -> TTS pipeline, too, and I allow users to switch between them. It's far less interactive, but it also gives much higher quality responses.

Best of luck!