> Claude/LLMs in general are still pretty bad at the intricate details of layouts and visual things

Because the rendered output (pixels, not HTML/CSS) is not fed as data in the training. You will find tons of UI snippets and questions, but they rarely included screenshots. And if they do, the are not scraped.

▲

Wowfunhappy a day ago | parent | next [-]

Interesting thought. I wonder if Anthropic et al could include some sort of render-html-to-screenshot as part of the training routine, such that the rendered output would get included as training data.

	▲	btown a day ago \| parent \| next [-]
		Even better, a tool that can tell the rendered bounding box of any set of elements, and what the distances between pairs of elements are, so it can make adjustments if relative positioning doesn't match its expectation. This would be incredible for SVG generation for diagrams, too.
	▲	KaiserPro a day ago \| parent \| prev [-]
		thats basically a VLM, but the problem is that describing the world requires a better understanding of the world. Hence why LeCunn is talking about world models (Its also cutting edge for teaching robots to manipulate and plan manipulations)

▲

ubercow13 a day ago | parent | prev [-]

Why wouldn't they be?

▲

littlecranky67 18 hours ago | parent [-]

Why would they be?

	▲	ubercow13 5 hours ago \| parent [-]
		Well, I don't know but many LLMs are multimodal and understand pictures and images. You can upload videos to Gemini and they're tokenised and fed into the LLM. If some programming blog post has a screenshot with the result of some UI code, why would that not be scraped and used for training? Is there some reason that wouldn't be possible?