| ▲ | sim04ful 3 days ago |
| The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user. I'm curious how anyone has solved this |
|
| ▲ | artur44 3 days ago | parent [-] |
| A simple way is to split the model’s output stream before TTS.
Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio. |
| |
| ▲ | pugio 3 days ago | parent [-] | | There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.) | | |
| ▲ | artur44 3 days ago | parent [-] | | True, but even with native audio-token models you still need to split the model’s output channels. Reasoning/internal tokens shouldn't go into the audio stream only user-facing content should be emitted as audio. The principle is the same, whether the last step is TTS or audio token generation. | | |
| ▲ | regularfry 2 days ago | parent [-] | | There's an assumption there that the audio stream contains an equivalent of the <think>/</think> tokens. Every reason to think it should, but without seeing the tokeniser config it's a bit of a guess. |
|
|
|