| ▲ | simonw 6 hours ago | |||||||||||||||||||
I was surprised that GLM 5.1/5.2 are not vision models - they are text input only. That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x. In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS. Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap. | ||||||||||||||||||||
| ▲ | x3cca 14 minutes ago | parent | next [-] | |||||||||||||||||||
I've been using Google ai studio as a free vision bridge. Gemma 31B is dummy capable at vision and at 1500 rpd its basically unlimited. | ||||||||||||||||||||
| ▲ | 0xbadcafebee 5 hours ago | parent | prev | next [-] | |||||||||||||||||||
Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything" | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | _pdp_ 6 hours ago | parent | prev | next [-] | |||||||||||||||||||
I don't see this being such a big gap. There are some use-cases for sure but apart from UX/UI work it is not really needed. Besides, none of the frontier models can replicate actual images - the can approximate at least in my own experience. | ||||||||||||||||||||
| ||||||||||||||||||||
| ▲ | ashenke 5 hours ago | parent | prev [-] | |||||||||||||||||||
I had the same reaction with Deepseek V4 ! It would be more useful as a vision model | ||||||||||||||||||||