All the latest general purpose models are multimodal (except DeepSeek I think). Transfer learning allows to improve results even after they exhausted all the text in the internet.