new | show | ask | jobs Github

comex 6 hours ago

LLMs are really bad at anything visual, as demonstrated by pelicans riding bicycles, or Claude Plays Pokémon.

Opus would probably do better though.

▲

tartoran 6 hours ago | parent [-]

How could they be any good at visuals? They are trained on text after all.

▲

comex 6 hours ago | parent | next [-]

Supposedly the frontier LLMs are multimodal and trained on images as well, though I don't know how much that helps for tasks that don't use the native image input/output support.

Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:

https://simonwillison.net/tags/pelican-riding-a-bicycle/

But they're still not very good.

▲

tartoran 6 hours ago | parent [-]

I have to admit I'm seeing this for the first time and am somewhat impressed by the results and even think they will get better with more training, why not... But are these multimodal LLMs still LLMs though? I mean, they're still LLMs but with a sidecar that does other things and the training of the image takes place outside the LLMs so in a way the LLMs still don't "know" anything about these images, they're just generating them on the fly upon request.

	▲	boxedemp 4 hours ago \| parent [-]
		Maybe we should drop one of the L's

▲

astrange 6 hours ago | parent | prev | next [-]

Claude is multimodal and can see images, though it's not good at thinking in them.

▲

msephton 6 hours ago | parent | prev | next [-]

Shapes can be described as text or mathematical formulas.

▲

tempest_ 6 hours ago | parent | prev [-]

An SVG is just text.