| ▲ | adrian_b 9 hours ago | |
That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3. With open weights LLMs, it is affordable to use many different models, each for whatever it is better. Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task. | ||