Remix.run Logo
jsheard 7 hours ago

It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.

Xelynega 7 hours ago | parent | next [-]

I agree, just want to make sure "they can't stop doing illegal things or they wouldn't be a success" is said out loud instead of left to subtext.

CuriouslyC 6 hours ago | parent [-]

They can't stop doing things some people don't like (people who also won't stop doing things other people don't like). The legality of the claims is questionable which is why most are getting thrown out, but we'll see if this narrow approach works out.

I'm sure there are also a number of easy technical ways to "include" the metadata while mostly ignoring it during training that would skirt the letter of the law if needed.

Xelynega 3 hours ago | parent [-]

If we really want to be technical, in common law systems anything is legal as long as the highest court to challenge it decides it's legal.

I guess I should have used the phrase "common sense stealing in any other context" to be more precise?

asdff 6 hours ago | parent | prev | next [-]

I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.

Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.

In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.

zozbot234 7 hours ago | parent | prev | next [-]

They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.

CaptainFever 7 hours ago | parent [-]

The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)

But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)

But who knows. Maybe it can be done for more fact-like stuff.

TeMPOraL 6 hours ago | parent [-]

> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.

All of that and more, all at the same time.

Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.

RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.

TeMPOraL 6 hours ago | parent | prev [-]

Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>

jprete 4 hours ago | parent [-]

Being part of a public court record doesn't seem like something that would invalidate copyright.