Gemini 2.5 has the best vision understanding of any model I've worked with. Leagues beyond gpt5/o4

It's hard to overstate this. They perform segmentation and masking and provide information from that to the model and it helps enormously.

Image understanding is still drastically lower than text performance, making glaring mistakes that are hard to understand but gemini 2.5 models are far and away the best in what I've tried.

▲

pineaux 6 days ago | parent | prev | next [-]

Yeah i made a small app to sell my fathers books. I scanned all the books by making pictures of the book shelves + books (collection of 15k books almost all non-fiction). Then fed them to different AI's. Combining mistralOCR and Gemini worked very very good. I ran all the past both AIs and compared the output per book. Then saved all the output into an SQL for later reference. I did some other stuff with it, then made a document out of the output and sent it to a large group of book buyers. I asked them to bid on individual books and the whole collection.

▲

devinprater 6 days ago | parent | prev | next [-]

There's a whole tool based on having Gemini 2.5 describe Youtube videos, OmniDescriber.

https://audioses.com/en/yazilimlar.php

▲

johnfn 6 days ago | parent | prev [-]

Interesting -- what sort of things do you use it for?

	▲	devinprater 6 days ago \| parent [-]
		Having Youtube videos described to me, basically. Since Google won't do it.