| ▲ | WarmWash 2 days ago | |
Even multimodal models are still really bad when it comes to vision. The strength is still definitely language. | ||
| ▲ | nostrebored a day ago | parent [-] | |
Training for tasks still works petty well, but “vision” is a super broad domain and most seem optimized for OCR and screen processing (which have verifiable outputs and relatively straightforward data generation) | ||