No, I'm quite aware of how LLMs work. They are statistical models. They have, however, already been caught reproducing source material accurately. There is, inherently, no way to actually stop that if the only training data for a given output is a limited set of inputs. LLMs can and do exhibit extreme overfitting.

As for the Anthropic lawsuit, the piracy part of the case is continuing. Most models are built with pirated or unlicensed inputs. The part that was decided on, although the decision imo was wrong, only covers if someone CAN train a model.

At no point have I claimed you can't train one. The question is can you distribute one, and then use one. An LLM is not simplistic enough to be considered a phonebook, so they can't just handwave that away.

Saying an LLM can do that is like saying an artist can make a JPEG of a Batman symbol, and that's totally okay for them to distribute because the JPEG artifacts are transformative. LLMs ultimately are just a clever way of compressing data, and compressors are not transformative under the law, but possessing a compressor is not inherently illegal, nor is using one on copyrighted material for your own personal use.

▲

Workaccount2 2 days ago | parent [-]

They will just put a dumb copyright filter on the output, a la YouTube or other hosting services.

Again, it's illegal for artists to recreate copyright, it's not illegal for them to see it or know it. It's not like you cannot hire a guy because he can perfectly visualize Pikachu in his head.

The conflation of training on copyright being equivalent to distribution of copyright is so disingenuous, and thankfully the courts so far recognize that.

	▲	DiabloD3 2 days ago \| parent [-]
		YouTube et al's copyright detection is mostly nonfunctional. It can only match exactly the same input with very little leeway. Even resizing it to a wrong ratio, or changing audio sampling rate too far fucks up the detection. Its illegal for artists to distribute recreated copyright in a way that is not transformative. It isn't illegal to produce it and keep it to themselves. People also distribute models, they don't merely offer them as a service. However, if someone asks their model to produce a copyright violation, and it does so, the person that created and distributed the model (its the distribution that is the problem), the service that ran it (assuming it isn't local inference), and the person that asked for the violation to be created can all be looped into the legal case. This has happened before, before the world of AI. Even companies that 100% participated in the copyright regime, quickly performed takedowns, ran copyright detection to the best of their ability were sued and they lost because their users committed copyright violation using their services, even though the company did everything right and absolutely above board. The law is stacked against service providers on the Internet, as it essentially requires them to be omniscient and omnipotent. Such requirements are not levied against other service providers in other industries.