Remix.run Logo
kazinator 6 hours ago

Suppose we have two image-oriented AI's.

One is trained with a set of pairs which match words with images. Vast numbers of images tagged with words.

The other is trained on a set of photographs of exactly the same scene from the same vantage point, but one in daylight and the other at night. Suppose all these images are copyrighted and used without permissions.

With the one AI, we can do word-to-image to generate an image. Clearly, that is a derived work of the training set of images; it's just interpolating among them based on the word assocations.

With the other AI, we can take a photograph which we took ourselves in daylight, and generate a night version of the same one. This is not clearly infringing on the training set, even though that output depends on it. We used the set without permission to have the machine extract and learn the concept of diurnal vs. nocturnal appearance of scenes, based on which it is kind of "reimagining" our daytime image as a night time one.

The question of whether AI is stealing material depends exactly on what the training pathway is; what it is that it is learning from the data. Is it learning to just crib, and interpolate, or to glean some general concept that is not protected by copyright: like separating mixed audio into tracks, changing day to night, or whatever.

kouteiheika 5 hours ago | parent [-]

> With the one AI, we can do word-to-image to generate an image. Clearly, that is a derived work of the training set of images

> The question of whether AI is stealing material depends exactly on what the training pathway is; what it is that it is learning from the data.

No it isn't. The question of whether AI is stealing material has little to do with the training pathway, but everything to do with scale.

To give a very simple example: is your model a trillion parameter model, but you're training it on 1000 images? It's going to memorize.

Is your model a 3 billion parameter model, but you're training it on trillions of images? It's going to generalize because it simply doesn't physically have the capacity to memorize its training data, and assuming you've deduplicated your training dataset it's not going to memorize any single image.

It literally makes no difference whether you'll use the "trained on the same scene but one in daylight and one at night" or "generate the image based on a description" training objective here. Depending on how you pick your hyperparameters you can trivially make either one memorize the training data (i.e. in your words "make it clearly a derived work of the training set of images").