▲ | ACCount37 5 days ago | |
LLM image frontends suck, and a lot of them suck big time. The naive approach of "use a pretrained encoder to massage the input pixels into a bag of soft tokens and paste those tokens into the context window" is good enough to get you a third of the way to humanlike vision performance - but struggles to go much further. Claude's current vision implementation is also notoriously awful. Like, "a goddamn 4B Gemma 3 beats it" level of awful. For a lot of vision-heavy tasks, you'd be better off using literally anything else. | ||
▲ | bubblyworld 5 days ago | parent [-] | |
Wild, I found it hard to believe that a 4b model could beat sonnet-3.5 at anything, but at least on the vision arena (https://lmarena.ai/leaderboard/vision) it seems like sonnet-3.5 is at the same ELO as a 27b gemma (~1150), so it's plausible. I guess that just says more about how bad vision LLMs are right now that anything else. |