What do you mean LLMs are blind? All frontier models are multimodal, which means they literally consume images as tokens. They can “see” exactly as well as they can “read”.

Also, GPT-Image-2 is not a diffusion model, it is based on Transformers, like other LLMs are.

▲

embedding-shape 6 hours ago | parent | next [-]

I guess they do "see" but more like "see an explanation of the image", not "see" as in experience visually. They're really bad at details and perfection when it comes to images, and doesn't understand things like visual hierarchy, affordances and other fundamental design concepts. Most of them are able to describe those things with letters, but doesn't seem to actually fundamentally grasp it when asking it to do UIs even when mentioning these things.

Try doing 100% vibe-coding with an agent and loosely specify what kind of application you want, and observe how the resulting UI and UX is a complete mess, unless you specify exactly how the UI and UX should work in practice.

If they actually had spatial understanding, together with being able to visually experience images, then they'd probably be able to build proper UI/UX from the get go, but since they only could describe what those things are, you end up with the messes even the current SOTAs produce.

▲

stingraycharles 28 minutes ago | parent | next [-]

> I guess they do "see" but more like "see an explanation of the image", not "see" as in experience visually.

Images are tokenized and fed to the exact same model, they can “visually inspect” images, eg “find the 2 differences between two images” and “where’s Waldo”-style things.

So your mental model that they see descriptions is inaccurate.

▲

spongebobstoes 5 hours ago | parent | prev | next [-]

the models can accept images directly as tokens. not a description of an image, the actual image itself.

yes, the visual intelligence is limited, but they do actually have vision capabilities.

▲

marcus_holmes 5 hours ago | parent | prev [-]

This is my experience too, but with all other aspects of the application. If you only loosely describe it, it comes out as a mess. You have to know what you're building to get the LLM to actually build something decent. I don't think this is purely a visual or design constraint.

▲

embedding-shape 5 hours ago | parent [-]

When I'm using agents for programming, I can have a AGENTS.md outlining exactly what requirements, guidelines and constraints all the code need to follow, and the agent (codex in my case) will pretty much nail that.

I've tried doing the same for design work, just really outlining exactly how the UI and UX needs to look and work, but for some reason it struggles a whole bunch with it, regardless of how clear I am. Maybe it's I'm just worse at explaining and describing what UI and UX I'm actually after though, I suppose.

	▲	marcus_holmes 3 hours ago \| parent [-]
		I once worked at a startup where the CEO was originally a designer. He once spent two days huddled with the main designer for the product, trying to pick exactly the right font for the product. I have no idea how you'd have that kind of discussion with an LLM. But then, I would not spend more than five minutes on this decision, so I'm probably the wrong audience for this ;)

▲

slashdave 6 hours ago | parent | prev | next [-]

Tokens are not a substitute for a numerical measurement.

Ask a LLM how much time has passed. Watch it hallucinate wildly.

Has anyone noticed that Opus has trouble building ascii diagrams (often leaves out spaces so lines are misaligned)?

	▲	arjie 2 hours ago \| parent \| next [-]
		LLMs are just one mechanical component. One might as well say "Ask your println how much time has passed". That is not a question that makes sense. As an example, I did not construct my agent specifically to answer your question and when I saw your question I queried the agent. And it is correct. https://imgur.com/a/j8j7hL9 As semiquaver said, modern LLMs are multi-modal, they can reason in image-space and audio-space as well as in text-space. It is not a translate then operate kind of situation. Claude Design is not a raw LLM, nor an instruction-tuned LLM. It is an agent harness around an LLM that allows it to do certain things.
	▲	semiquaver 6 hours ago \| parent \| prev [-]
		Ok? Your comment is in no way responsive to anything I said.

▲

bombcar 6 hours ago | parent | prev [-]

Claude has been kicking ass at code, but I asked it to “sketch” a second floor with a stairway and bedrooms with large closets and it made … something that resembles something akin to not at all what I asked.