The latest models are natively multimodal. Audio, video, images, text, are all tokenised and interpreted in the same model.