Pretty much mirrors my experience using GPT to generate images creatively. I tried to generate an image to accompany a Robert frost poem and it made something... plausibly related. But not what I was describing. I spent the next 90% of the time making it 10% closer to what I wanted but it never got all the way there.

I’ve given it different levels of open-endednes, give this flow chart an aesthetic like this mechanical keyboard, or generate an SVG of this graphic from a 70s slide show, but it never looks quite like what I have in mind.

In the end, I think you only use this stuff to generate images if you’re prepared to accept whatever comes out on approximately the first try.

▲

TheOtherHobbes 8 hours ago | parent [-]

This isn't a solvable problem without world models. Tokenised prompting is like stabbing a pin at a huge target in the dark. Sometimes something interesting falls out, but latent space doesn't have the definition to give most people exactly what they want.

When it does, it's more likely to be something popular and unoriginal, where the data is dense, and less likely to be something inventive and strange.

▲

xienze 8 hours ago | parent [-]

> This isn't a solvable problem without world models.

I wish we could use something like a simple DSL rather than English prose to work with these models, in order to have some real precision to describe what we want.

	▲	asnyder 3 hours ago \| parent [-]
		Nothing stops that from happening. Just needs to be trained in that DSL. Though at that point it returns to it's original form as a better autocomplete/IntelliSense :). That will likely happen in the specialized fields. We can already see tools like Figma, Mira, and others that generate functional-ish frontend components in full typescript and corresponding styles (that are also selectable and configurable in the interface). Though, not quite as free, since they do load their base framework and components to ensure consistency and sanity / error-checking, etc., but even then it is in fact generating you useable, modifiable components that you can engage with in precision in your normal DSL. For video, this likely exists, or is being worked on as we speak. All specialized domain tools will go towards this model to allow those domain experts to use the tools with the precision they expect AND the agentic gains we already take for granted.