| ▲ | raw_anon_1111 3 hours ago | |
If it is doing a tool call, it has to convert the speech to text or at least a JSON object of the necessary parameters for the tool and convert the result to speech doesn’t it? Is it truly speech to speech then? | ||
| ▲ | satvikpendem 3 hours ago | parent [-] | |
It's all tokens at the end of the day, not really text or video or audio, just like everything on a machine is just bits of 1s and 0s and it's up to the program to interpret them as a certain file format. These models are more speech-to-speech (+ text) in that they can recognize text tokens too. So the flow is, you ask it something, then, Audio Tokens: "Let me check that for you..." (Sent to the speaker) Special Token: [CALL_TOOL: get_weather] Text Tokens: {"location": "Seattle, WA"} Special Token: [STOP] The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that. | ||