"Reading stuff freely posted on the internet" constitutes stealing now?

Seems like an excessively draconian interpretation of property rights.

▲ michaelmior 4 days ago | parent | next [-]

"Reading stuff freely posted on the internet" is also very different from a business having machines consume large volumes of data posted on the Internet for the purpose of generating value for them without compensating the creators. I'm not making a value judgement one way or the other, but "reading stuff freely posted on the Internet" is an oversimplification.

▲ marssaxman 4 days ago | parent | next [-]

Okay, but "stealing" is also an oversimplification, to the point of absurdity.

It makes no sense to put stuff up on the internet where it can freely be downloaded by anyone at any time, by people who are then free to do whatever they like with it on their own hardware, then complain that people have downloaded that stuff and done what they liked with it on their own hardware.

"Having machines consume large volumes of data posted on the Internet for the purpose of generating value for them without compensating the creators" is equally a description of Google.

▲ ehnto 4 days ago | parent | next [-]

They are not free to do whatever they like, there are tomes of laws across all countries governing what someone can and cannot do with your intellectual property. Just because we didn't have the foresight to add in a "if by chance in the future someone invents artificial intelligence, that's not fair use" is a shame, but doesn't make what these companies are doing ethical or morale.

I don't disagree regarding Google, I also think they exploited others IP for their own gain. It was once symbiotic with webmasters, but when that stopped they broke that implied good faith contract. In a sense, their snippets and widgets using others IP and no longer providing traffic to the site was the warning shot for where we are now. We should have been modernising IP laws back then.

▲ marssaxman 4 days ago | parent [-]

I did say "free to do whatever they like on their own hardware", because intellectual property laws generally govern the transfer of such property rather than the use.

After seeing the harm done by the expansion of patent law to cover software algorithms, and the relentless abuse done under the DMCA, I am reflexively skeptical of any effort to expand intellectual property concepts.

▲ godelski 4 days ago | parent [-]

  > on their own hardware

That doesn't make it technically legal. That only makes it not worth pursuing. You can sue Joe Schmoe for a million dollars but if he doesn't have that then you're not getting a dime. But if Joe Schmoe is using that thing to make money, well then... yeah you bet your ass that's a different situation and the "worth" of pursuing is directly proportional to how much he is making. Doesn't matter if it is his own hardware or not.

Like why do you think who owns the hardware even matters? Do you really think the legality changes if I rent a GPU vs use my own? That doesn't make any sense.

▲ marssaxman 4 days ago | parent [-]

In terms of copyright law, it matters very much whether Joe Schmoe is using his own copy of the data for his own purposes, or whether he is making more copies and distributing them to other people.

If the AI companies were letting people download copies of their training data, copyright law would certainly have something to say about that. But no: once they download the training data, they keep it, and they don't share it.

▲ godelski 4 days ago | parent [-]

  > using his own copy of the data

Yes? That is a different thing? I guess we can keep moving the topic until we're talking about the same topic if you want. But honestly, I don't want to have that kind of conversation.

▲ marssaxman 4 days ago | parent | next [-]

How is it a different thing? Are we talking about copyright law, or not?

▲ godelski 3 days ago | parent [-]

Before you were talking about data you don't own on hardware you do. Now you're talking about data you do own.

The whole thing is about who owns the data!

▲ marssaxman 3 days ago | parent [-]

I can imagine that it would indeed be confusing if you failed to distinguish between ownership of the data and ownership of the copyright.

	▲	godelski 3 days ago \| parent [-]
		Sure... now go back to your edgy comment and keep this in mind to see why everyone is arguing with you `>>...> To be honest, these companies already stole terabytes of data and don't even disclose their dataset, so you have to assume they'll steal and train at anything you throw at them >...> "Reading stuff freely posted on the internet" constitutes stealing now?` Literally everyone was talking about data ownership and you just said "I can download it, so it is fair game on my hardware." Let's say you didn't intend to say that. Well that doesn't matter, that's what a lot of people heard and you failed to clarify when pressed on this. So yeah, I think you're doing gymnastics https://news.ycombinator.com/item?id=45066376

▲ derangedHorse 3 days ago | parent | prev [-]

It doesn’t seem like anyone is moving topics here. Where do you see the topic being moved?

	▲	godelski 3 days ago \| parent [-]
		"His own hardware" != "his own copy of the data" My entire comment was that the entire issue is about data ownership. Doesn't even matter if you have a copy of the data. It matters how that copy was obtained. There's no reason to then discuss if your usage violates the terms of a license if you obtained the data illegally. You're already in the illegal territory lol. Having data != legally having obtained data

▲ schwartzworld 4 days ago | parent | prev | next [-]

What if that data isn’t publicly posted? For example, copilot regurgitating code from private repos, complete with comments.

▲ sobkas 4 days ago | parent | prev | next [-]

Proper term for it is Computer Assisted Plagiarism, CAP for short. Also, I really hope that Google doesn't claim it created sites it crawl for search their engine.

▲ thrwaway55 3 days ago | parent | prev | next [-]

Ok so if I publish under a license saying I don't allow for it to be used for AI do you believe they respect it? What word would you use to describe this violation? Go ahead throw up a robots.txt, throw up a license. You will be able to coax the "fair use" stochastic parrots to render it verbatim.

Sam Altman and his ilk are exploiting the incredibly slow moving legal system to enrich themselves.

	▲	heavyset_go 11 hours ago \| parent [-]
		> Ok so if I publish under a license saying I don't allow for it to be used for AI do you believe they respect it? It's even worse than that, they don't even legally have to respect it if courts find it to be fair use, and so far they have. If it's fair use to train models on it, your license means nothing. The only way to "win" is to not publish your code at all, anywhere.

▲ nerdponx 4 days ago | parent | prev | next [-]

It's not about the downloading of the data, it's about its use in training models, which is dubious from a copyright perspective.

▲ uncletscollie 4 days ago | parent | prev | next [-]

That is not at all how the internet works. Try to download music from Napster and Lars will sue your ass.

▲

marssaxman 4 days ago | parent [-]

No he certainly will not; you will only get sued if you upload Lars' music to share with other people. If you download an illegal copy, the person you downloaded from is the one breaking the law.

	▲	coldtea 3 days ago \| parent [-]
		You're breaking the law too - just like accepting stolen goods is also breaking the law, not just selling them.

▲ godelski 4 days ago | parent | prev | next [-]

  > where it can freely be downloaded by anyone at any time, by people who are then free to do whatever they like with it on their own hardware

I think you have a strong misunderstanding of the law and the general expectation of others.

I'd like to remind you that a lot of celebrities face legal issues for posting photos of themselves. Here's a recent example with Jennifer Lopez[0]. The reason these types of lawsuits are successful is because it is theft of labor. If you hire a professional photographer to take photos of your wedding then the contract is that the photographer is handing over ownership of the photos in exchange of payment. The only difference here is that the photo was taken before a contract was made. The celebrity owns the right to their body and image, but not to the photograph.

Or think about Open Source Software. Just because it is posted on GitHub does not mean you are legally allowed to use it indiscriminately. GitHub has licenses and not all of them are unrestricted. In fact, a repo without a license does not mean unfettered usage. The default is that the repo owner has the copyright[1].

  > You're under no obligation to choose a license. However, without a license, the default copyright laws apply, meaning that you retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work.

A big part of what will make a lawsuit successful or not is if the owner has been deprived of compensation. As in, if you make money off of someone else's work. That's why this has been the key issue in all these AI lawsuits. Where the question is about if the work is transformative or not. All of this is in new legal territory because the laws were not written with this usage in mind. The transformative stuff is because you need to allow for parody or referencing. You don't want a situation where, say... someone including a video of what the president has said to discuss what was said[2]. But this situation is much closer to "Joe stole a book, learned from that book, and made a lot of money through the knowledge that they obtained from this book AND would not have been able to do without the book's help." Just, it's usually easier to go after the theft part of that situation. It's definitely a messy space.

But basically, just because a piece of art exists on public property does not mean you have the right to do whatever you want with it.

  >  is equally a description of Google.

Yes and no. The AI summaries? Yeah. The search engine and linking? No. The latter is a mutually beneficial service. It's one thing to own a taxi service and it is another to offer a taxi service that will walk into a starbucks take a random drink off the counter and deliver it to you. I'm not sure why this is difficult to understand.

[0] https://www.bbc.com/news/articles/cx2qqew643go

[1] https://docs.github.com/en/repositories/managing-your-reposi...

[2] https://www.youtube.com/watch?v=tUnRWh4xOCY

▲ pigeons 4 days ago | parent | prev | next [-]

But they didn't only train on information the creators made freely available. They trained on copyrighted materials obtained illicitly.

	▲	pigeons 4 days ago \| parent [-]
		I know we're not supposed to comment about downvotes, but the original comment was talking about "these companies", and none of the information indicating that they, or at the very least Meta, trained on terabytes of books downloaded from zlib and libgen and other torrent sites, is in dispute. So even if you believe that copyright should not exist, I don't see why this is not a valid dispute of the parents argument that they only trained on information creators made freely available.

▲ vunderba 4 days ago | parent | prev [-]

> "Having machines consume large volumes of data posted on the Internet for the purpose of generating value for them without compensating the creators" is equally a description of Google.

Quid pro quo. Those sites also received traffic from the audiences searching using Google. "Without compensation" really only became a thing when Google started adding the inlined cards which distilled the site's content thus obviating the need for a user to visit the aforementioned site.

	▲	godelski 4 days ago \| parent \| next [-]
		I'm not sure quid pro quo even matters. A search engine is more like providing a taxi service. You're just taking people to a place. Now the AI summaries are a different story. One where there is no quid pro quo either. It's different when that taxi service will also offer the same service as that business. It's VERY different when that taxi service will walk into that business, take their services free of charge[0], and then transfer that to the taxi customer. [0] Scraping isn't going to offer ad revenues [Side note] In our analogy the little text below the link it more like the taxi service offering some advertising or some description of the business. Bit more gray here but I think the quid pro quo phrase applies here. Taxi does this to help customer find the right place to go, providing the business more customers. But the taxi isn't (usually) replacing the service itself.
	▲	derangedHorse 3 days ago \| parent \| prev [-]
		Arguments like this never work out. There is no agreed upon compensation for being listed. If I didn’t want my site listed by Google and it was listed anyway, I may not think the traffic justifies my subjective “cost” of being listed. There’s also no legal protection against having my publicly accessible site and the title in its html from being shown (as there shouldn’t be).

▲ bdamm 4 days ago | parent | prev | next [-]

We didn't seem to mind when Google was doing it back in 1999, or Lycos, Altavista, etc before them... why do we care about the LLM companies doing it now?

▲

codazoda 4 days ago | parent | next [-]

I find LLMs extremely useful but I think the difference is that they regurgitate the content (not verbatim) instead of a link to it. This is not unlike how a human might tell their friend about it.

	▲	bdamm 4 days ago \| parent \| next [-]
		Google has been regurgitating content right into search results since the very beginning, and they've been providing "synopsis" type of results for over a decade.
	▲	Nevermark 4 days ago \| parent \| prev [-]
		> This is not unlike how a human might tell their friend about it. Is there someone who has read the whole internet? Can we all be there friend? The entire basis of fair use is scale matters.

▲

nbulka 4 days ago | parent | prev [-]

Because they have terms of service they have to adhere to. We need laws to be lawful.

▲ senko 4 days ago | parent | prev | next [-]

I consumed large volumes of data posted on the internet for decades, which generated a lot of value for me, without compensating the creators.

The only difference is that I (presumably) have a soul.

▲ gist 4 days ago | parent | prev | next [-]

> "Reading stuff freely posted on the internet" is also very different from a business having machines consume large volumes of data posted on the Internet for the purpose of generating value for them without compensating the creators.

The fact that value is being created is irrelevant. The fact that they are making profit is irrelevant. As is non compensation to creators. There isn't any law being broken. Is there?

Bottom line in real world terms there is no expectation of privacy with a freely open and unrestricted web site. Even if that website said 'you can use this for single use but not mass use' that in itself is not legally or practically enforceable.

Let's take the example of a Christmas light show. The idea might be (in the homeowners mind) that people, families, will drive by in their cars to enjoy the light show (either a single home or the entire street or most of it). They might think 'we don't want buses full of people who paid to ride the bus' coming down the street. Unfortunately there is no way to prevent that (without the city and laws getting involved) and there is nothing wrong with the fact that the people who provide the bus are making money bringing people to see the light show.

▲ jMyles 4 days ago | parent | prev [-]

> "Reading stuff freely posted on the internet" is also very different from a business having machines consume large volumes of data

...not if you believe in the right of general-purpose computing. If they have the right to read the data, why don't they have a right to program a computer to do it for them?

I think we all agree that they're not the good guys here, but this reasoning in particular is troubling.

▲ TheRoque 4 days ago | parent | prev | next [-]

I'm not talking about that, I'm taking about downloading gigabytes of books, and movies and who knows what data (since it's not disclosed) without paying. Those are not freely posted on the internet. Well, not legally anyways.

▲ Sohcahtoa82 4 days ago | parent | prev | next [-]

This is a quintessential bad faith comment.

The reference to terabytes of stolen data refers to copyrighted material. I think you know this but chose to frame it as "stuff freely posted on the internet" in order to mislead and strawman the other comment.

▲

marssaxman 4 days ago | parent [-]

I meant it exactly as I said it. I do not agree that any theft occurred, either in law or in spirit, and I believe that reinterpretation of intellectual-property law in order to make it a crime would cause significant harm, greatly outweighing the benefits, as has been the case with every other expansion of intellectual property law I have seen.

▲

fcarraldo 4 days ago | parent [-]

Anthropic downloaded books from Library Genesis and The Pirate Library mirror. This is factual and reported on from court documents.

What’s the angle that describes this as fair use?

[0] https://www.businessinsider.com/anthropic-cut-pirated-millio...

▲

marssaxman 4 days ago | parent [-]

The simple fact that they are not republishing any of that data. Fair use does not apply, because copyright does not apply, because nothing is being copied.

▲

Wowfunhappy 4 days ago | parent [-]

So you don't think downloading something from The Pirate Bay constitutes copyright infringement provided you don't republish it?

▲

marssaxman 4 days ago | parent [-]

Precisely. The person sharing is the one breaking the law.

	▲	thrwaway55 3 days ago \| parent \| next [-]
		I just want to confirm this, you believe that when OpenAI and their agents post copyright material that they did not pay for verbatim it is breaking the law?
	▲	coldtea 3 days ago \| parent \| prev \| next [-]
		You are wrong then. Confidently wrong. U.S.: Downloading = infringement. If prosecuted, usually gets civil lawsuits/fines, not jail. E.U.: Same — both downloading/hosting illegal, but hosts get cracked down harder.
	▲	TheRoque 4 days ago \| parent \| prev [-]
		That's factually wrong, downloading without sharing is also illegal.

▲ themafia 4 days ago | parent | prev | next [-]

Faithfully reproducing something you've previously read while passing it off as your own original work is a violation of the most basic tenets of intellectual property rights.

▲ WA 4 days ago | parent | prev | next [-]

Forgot the 82TB of torrented books Meta has been using for training? I mean, yeah, it’s Meta. No surprise. But I won’t believe for one second that the other players didn’t do a similar thing. They just haven’t been caught yet.

▲ exe34 4 days ago | parent | prev | next [-]

so I can take a screenshot from a movie trailer on YouTube and sell posters of it now? I thought copyright still applied to the poor.

▲ 4 days ago | parent | prev | next [-]

[deleted]

▲ coldtea 3 days ago | parent | prev | next [-]

"Reading stuff freely posted on the internet" that has copyrights to be used in your generative AI service is stealing, is a pretty basic interpretation of property rights.

▲ timeon 4 days ago | parent | prev | next [-]

What "reading"?

▲

marssaxman 4 days ago | parent | next [-]

The same reading search engine crawlers have been doing since time immemorial.

	▲	ehnto 4 days ago \| parent \| next [-]
		No one gave them permission to access their webservers back then either. Before it's cited that there is precedent in law, that is in the US. No such precedent exists in my country, and our laws suggest that unauthorized access regardless of "gates up or down" would constitute trespassing. There are also no protections for scrapers coming out of prior lawsuits, and copying copyrighted material is of course illegal. Which is just to point out that the world wide web is not its own jurisdiction, and I believe AI companies are going to be finding that an ongoing problem. Unlike search, there is no symbiosis here, so there is an incentive to sue. The original IP holders do not benefit in any way. Search was different in that way.
	▲	TheRoque 4 days ago \| parent \| prev [-]
		Search engines never claimed that their content was orignal, and redirect to the original author (which gets proper retribution)

▲

kridsdale1 4 days ago | parent | prev [-]

Looking at and gaining knowledge.

▲ estimator7292 3 days ago | parent | prev [-]

As long as people are being prosecuted for piracy or having their livelihoods compromised for including a 16 second clip of a song, yes.