Even with search grounding, it scored a 2.5/5 on a basic botanical benchmark. It would take much longer for the average human to do a similar write-up, but they would likely do better than 50% hallucination if they had access to a search engine.

▲

WarmWash 2 days ago | parent [-]

Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.

	▲	nostrebored a day ago \| parent [-]
		Training for tasks still works petty well, but “vision” is a super broad domain and most seem optimized for OCR and screen processing (which have verifiable outputs and relatively straightforward data generation)