Remix.run Logo
onion2k 3 days ago

Someone tried to generate a retro hip-hop album cover image with AI, but the text is all nonsense, and humans would have to be hired to clean that AI slop

In about two years we've gone from "AI just generates rubbish where the text should be" to "AI spells things pretty wrong." This is largely down to generating a whole image with a textual element. Using a model like SDXL with a LORA like FOOOCUS to do inpainting and input image with a very rough approximation of the right text (added via MS Paint) you can get a pretty much perfect result. Give it another couple of years and the text generation will be spot on.

So yes, right now we need a human to either use the AI well, or to fix it afterwards. That's how technology always goes - something is invented, it's not perfect, humans need to fix the outputs, but eventually the human input diminishes to nothing.

zdragnar 3 days ago | parent | next [-]

> That's how technology always goes

This is not how AI has ever gone. Every approach so far has either been a total dead end, or the underlying concept got pivoted into a simplified, not-AI tech.

This new approach of machine learning content generation will either keep developing, or it will join everything else in the history of AI by hitting a point of diminishing to zero returns.

selalipop 3 days ago | parent | next [-]

But their comment is about 2 years out of date, and AI image gen has got exponentially better at text than when the models and LoRAs they mentioned were SOTA.

I agree we probably won't magically scale current techniques to AGI, but I also think the local maxima for creative output is going to be high enough that it changes how we approach it the way computers changed how we approach knowledge work.

That's why I focus on it at least.

onion2k 3 days ago | parent | prev [-]

This is not how AI has ever gone. Every approach so far has either been a total dead end, or the underlying concept got pivoted into a simplified, not-AI tech.

You're talking about the progress of technology. I'm talking about how humans use technology in it's early states. They're not mutually exclusive.

vunderba 3 days ago | parent | prev [-]

Minor correction. FOOCUS [1] isn't a LoRA - it's a Gradio-based frontend (in the same vein as Automatic1111, Forge, etc.) for image generation.

And most SOTA models (Imagen, Qwen 20b, etc) at this point can actually already handle a fair amount of text in a single T2I generation. Flux Dev provided your willing to roll a couple gens can do it as well.

[1] https://github.com/lllyasviel/Fooocus