| ▲ | satvikpendem 3 hours ago | |
It's all tokens at the end of the day, not really text or video or audio, just like everything on a machine is just bits of 1s and 0s and it's up to the program to interpret them as a certain file format. These models are more speech-to-speech (+ text) in that they can recognize text tokens too. So the flow is, you ask it something, then, Audio Tokens: "Let me check that for you..." (Sent to the speaker) Special Token: [CALL_TOOL: get_weather] Text Tokens: {"location": "Seattle, WA"} Special Token: [STOP] The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that. | ||