▲ | IanCal 6 days ago | |
It's hard to overstate this. They perform segmentation and masking and provide information from that to the model and it helps enormously. Image understanding is still drastically lower than text performance, making glaring mistakes that are hard to understand but gemini 2.5 models are far and away the best in what I've tried. |