▲ | adastra22 10 hours ago | |||||||||||||||||||||||||||||||||||||||||||
Isn’t that how all LLMs work? | ||||||||||||||||||||||||||||||||||||||||||||
▲ | simonw 10 hours ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||
The existing vision LLMs all work like this, which is most of the major models these days. Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio. I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does. | ||||||||||||||||||||||||||||||||||||||||||||
|