Remix.run Logo
raw_anon_1111 3 hours ago

If it is doing a tool call, it has to convert the speech to text or at least a JSON object of the necessary parameters for the tool and convert the result to speech doesn’t it? Is it truly speech to speech then?

satvikpendem 3 hours ago | parent [-]

It's all tokens at the end of the day, not really text or video or audio, just like everything on a machine is just bits of 1s and 0s and it's up to the program to interpret them as a certain file format. These models are more speech-to-speech (+ text) in that they can recognize text tokens too. So the flow is, you ask it something, then,

Audio Tokens: "Let me check that for you..." (Sent to the speaker)

Special Token: [CALL_TOOL: get_weather]

Text Tokens: {"location": "Seattle, WA"}

Special Token: [STOP]

The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.