Well, behind "models" not "langual models".
Of course models purely made for image stuff will completely wipe it out. The vision language models are useful for their generalist capabilities