Remix.run Logo
Dylan16807 3 days ago

> In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings.

That's called extreme overfitting. Proper training is supposed to give subtle nudges toward matching each source of text, and zillions of nudges slowly bring the whole thing into shape based on overall statistics and not any particular sources. (But that does require properly removing duplicate sources of very popular text which seems to be an unsolved problem.)

So your analogy is far enough off that I can't give it a good reply.

> It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style.

I haven't seen anyone defend the piracy, and the piracy is what this settlement is about.

People are defending the training itself.

And I don't think anyone would seriously say the AI version is fair use but the human version isn't. You really think "many people" feel that way?

Retric 3 days ago | parent [-]

There isn’t a clear line for extreme overfitting here.

To generate working code the output must follow the API exactly. Nothing separates code and natural language as far as the underlying algorithm is concerned.

Companies slightly randomize output to minimize the likelihood of direct reproduction of source material, but that’s independent of what the neural network is doing.

Dylan16807 3 days ago | parent [-]

You want different levels of fitting for different things, which is difficult. Tight fighting on grammar and APIs and idioms, loose fitting on creative text, and it's hard to classify it all up front. But still, if it can recite harry potter that's not on purpose, and it's never trained to predict a specific source losslessly.

And it's not really about randomizing output. The model gives you a list of likely words, often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.

Retric 3 days ago | parent [-]

> often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.

It’s very rare for multiple outputs to actually be equal so the only choice is to choose one at random. Instead its become accepted practice to make sub optimal choices for a few reasons, one of which really is to decrease the likelihood of reproducing existing text.

Nobody wants a headline like: “Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book” https://www.understandingai.org/p/metas-llama-31-can-recall-...

Dylan16807 3 days ago | parent [-]

I will say that picking the most likely word every single time isn't optimal.

Retric 3 days ago | parent [-]

I agree there’s multiple reasons to slightly randomize output, but there’s also downsides.