Remix.run Logo
WarmWash 2 days ago

Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.

nostrebored a day ago | parent [-]

Training for tasks still works petty well, but “vision” is a super broad domain and most seem optimized for OCR and screen processing (which have verifiable outputs and relatively straightforward data generation)