▲ | djoldman 5 days ago | |||||||
The data set is a list of ("descriptive text", URL) tuples. As with almost any URL, it is not in and of itself an image. As an aside, this presents a problem for researchers because the links can resolve to different resources, or no resource at all, depending on when they are accessed. Therefore this is not a static dataset on which a machine learning model can be trained in a guaranteed reproducible fashion. | ||||||||
▲ | pera 5 days ago | parent [-] | |||||||
I think you may be missing the point: The title says "AI training data set", which is the result of downloading the linked images. The list of tuples is just how this training dataset is distributed. The issue in question is that many/most large generative AI models were trained with personal data. | ||||||||
|