Remix.run Logo
XenophileJKO 2 hours ago

It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.