| ▲ | firecall 3 hours ago | |||||||
AFAIK the data does not need to be text. | ||||||||
| ▲ | teaearlgraycold 2 hours ago | parent [-] | |||||||
Well diffusers are trained unsupervised on raw pictures. I don't know how they train multi-modal LLMs on images, but yes obviously they are consuming other media than just text. I don't think, but would be happy to be corrected, that models glean much of their "knowledge" from non-textual training data. | ||||||||
| ||||||||