| ▲ | samatman an hour ago | |||||||
It's more complicated than that. Quite a bit more. Commercial use counts _against_ a fair use defense, but is not dispositive: it's not accurate at all to say it "generally does not cover" commercial use. This is the "purpose and character" test, one of four in contemporary (United States) fair use doctrine. Purpose and character also includes the degree to which a use is _transformative_. It's clear that the degree to which a training run mulching texts "transforms" them is very high. This counts toward a fair use finding for purpose and character. > is dependent on the amount of the original content present in the derived work, which I would contend in this case is “all of it” The "amount and substantiality" test. Your case for "all of it" can't possibly be sustained: the models aren't big enough. It's amount _and_ substantiality: this has come up in the publication of concordances, where a relatively large amount of a copyrighted work appears, but it's chopped up and ordered in a way which is no longer substantially the same. Courts have ruled that this kind of text is fair use, pretty consistently. It's not an LLM, of course, but those have yet to be ruled on. Also worth knowing that courts have never accepted reading or studying a work as incorporation, and are unlikely to change course on the question. It's taken for granted that anyone is allowed to read a copyrighted work in as much detail as they wish, in the course of producing another one. Model training isn't reading either, but the question is to what degree it resembles study. I'd say, more than not. Specifically: > it’s impossible to make a useful model without the whole book and all of the artistry that went into it Courts have never once accepted "it would be impossible for defendant to write his biography without reading plaintiff's" as valid, and it's been tried. The standard for plagiarism is higher than that. "Effect upon the work's value" is probably the most interesting one. For some things, extreme, for others, negligible. I suspect this is the one courts are going to spend the most time on as all of these questions are litigated. Ultimately, model training is highly out-of-distribution for the common law questions involving fair use. It was not anticipated by statute, to put it mildly. The best solution to that kind of dilemma is more statute, and we'll probably see that, but, I don't think you'll be happy with the result, given what I'm replying to. Just a guess on my part. | ||||||||
| ▲ | mplanchard an hour ago | parent [-] | |||||||
It is of course true that it is unsettled law, and that fair use is more complicated than my offhand comment suggested. > Courts have never once accepted "it would be impossible for defendant to write his biography without reading plaintiff's" as valid, and it's been tried. The standard for plagiarism is higher than that. This I think misses the thrust of my argument, though. Its hard to find an exact human analogy, because neither the technology nor the scale at which it operates is remotely human. I see it less as “writing his biography without reading the plaintiff’s” and it’s more “using the same style and metaphors to make thousands of copies of very similar biographies, with certain bits tweaked,” like turning an existing work into mad lib. I don’t know how the courts will eventually rule on it, but it certainly feels like theft to me. | ||||||||
| ||||||||