> they've stolen a mountain of information

In law, training is not itself theft. Pirating books for any reason including training is still a copyright violation, but the judges ruled specifically that the training on data lawfully obtained was not itself an offence.

Cloudfare has to block so many more bots now precisely because crawling the public, free-to-everyone, internet is legally not theft. (And indeed would struggle to be, given all search engines have for a long time been doing just that).

> As the arms race continues AI DDoS bots will have less and less recent "training" material

My experience as a human is that humans keep re-inventing the wheel, and if they instead re-read the solutions from even just 5 years earlier (or 10, or 15, or 20…) we'd have simpler code and tools that did all we wanted already.

For example, "making a UI" peaked sometime between the late 90s and mid 2010s with WYSIWYG tools like Visual Basic (and the mac equivalent now known as Xojo) and Dreamweaver, and then in the final part of that a few good years where Interface Builder finally wasn't sucking on Xcode. And then everyone on the web went for React and Apple made SwiftUI with a preview mode that kept crashing.

If LLMs had come before reactive UI, we'd have non-reactive alternatives that would probably suck less than all the weird things I keep seeing from reactive UIs.

▲ Anamon 3 days ago | parent [-]

> Cloudfare has to block so many more bots now precisely because crawling the public, free-to-everyone, internet is legally not theft.

That is simply not true. Freely available on the web doesn't mean it's in the Public Domain. The "lawfully obtained" part of your argument is patently untrue. You can legally obtain something, but that doesn't mean any use of it is automatically legal as well. Otherwise, the recent Spotify dump by Anna's Archive would be legal as well.

It all depends on the license the thing is released under, chosen by the person who made it freely accessible on the web. This license is still very emphatically a legally binding document that restricts what someone can do with it.

For instance, since the advent of LLM crawling, I've added the "No Derivatives" clause to the CC license of anything new I publish to the web. It's still freely accessible, can be shared on, etc., but it explicitly prohibits using it for training ML models. I even add an additional clause to that effect, should the legal interpretation of CC-ND ever change. In short, anyone training an LLM on my content is infringing my rights, period.

	▲	ben_w 3 days ago \| parent [-]
		> Freely available on the web doesn't mean it's in the Public Domain. Doesn't need to be. > The "lawfully obtained" part of your argument is patently untrue. You can legally obtain something, but that doesn't mean any use of it is automatically legal as well. I didn't say "any" use, I said this specific use. Here's the quote from the judge who decided this: `5. OVERALL ANALYSIS. After the four factors and any others deemed relevant are “explored, [ ] the results [are] weighed together, in light of the purposes of copyright.” Campbell, 510 U.S. at 578. The copies used to train specific LLMs were justified as a fair use. Every factor but the nature of the copyrighted work favors this result. The technology at issue was among the most transformative many of us will see in our lifetimes.` - https://storage.courtlistener.com/recap/gov.uscourts.cand.43... > Otherwise, the recent Spotify dump by Anna's Archive would be legal as well. I specifically said copyright infringement was separate. Because, guess what, so did the judge the next paragraph but one from the quote I just gave you. > For instance, since the advent of LLM crawling, I've added the "No Derivatives" clause to the CC license of anything new I publish to the web. It's still freely accessible, can be shared on, etc., but it explicitly prohibits using it for training ML models. I even add an additional clause to that effect, should the legal interpretation of CC-ND ever change. In short, anyone training an LLM on my content is infringing my rights, period. It will be interesting to see if that holds up in future court cases. I wouldn't bank on it if I was you.