Remix clone Hacker News

new | show | ask | jobs Github

	▲	raw_anon_1111 3 hours ago
		If it is doing a tool call, it has to convert the speech to text or at least a JSON object of the necessary parameters for the tool and convert the result to speech doesn’t it? Is it truly speech to speech then?
	▲	satvikpendem 3 hours ago \| parent [-]
		It's all tokens at the end of the day, not really text or video or audio, just like everything on a machine is just bits of 1s and 0s and it's up to the program to interpret them as a certain file format. These models are more speech-to-speech (+ text) in that they can recognize text tokens too. So the flow is, you ask it something, then, Audio Tokens: "Let me check that for you..." (Sent to the speaker) Special Token: [CALL_TOOL: get_weather] Text Tokens: {"location": "Seattle, WA"} Special Token: [STOP] The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.