Considering most SOTA LLMs are also multimodal/vision models, could they get better results if the LLM gets visual feedback with it?