very bad take. with most modern multomodal models you get way better performance then going to text first