▲ | jumploops 17 hours ago | |||||||||||||
This new model is autoregression-based (similar to LLMs, token by token) rather than diffusion based, meaning that it adheres to text prompts with much higher accuracy. As an example, some users (myself included) of a generative image app were trying to make a picture of person in the pouch of a kangaroo. No matter what we prompted, we couldn’t get it to work. GPT-4o did it in one shot! | ||||||||||||||
▲ | yousif_123123 17 hours ago | parent | next [-] | |||||||||||||
It's a mix of both it feels to me as I've been testing it. For example, you can't get it to make a clock showing custom time like 3:30, or someone writing with their left hand.. And it can't do follow many instructions or do them very precisely. But it shows that this kind of architecture will be be capable of that if scaled up most likely. | ||||||||||||||
| ||||||||||||||
▲ | n2d4 16 hours ago | parent | prev [-] | |||||||||||||
Source? It's much more likely that the LLM generates the latent vector which serves as an input to the diffusion model. | ||||||||||||||
|