No. OpenAI is suspected to be worth over $150B. They can absolutely afford to pay people for data.

Edit: People commenting need to understand that $150B is the discounted value of future revenues. So... yes they can pay out... yes they will be worth less... and yes that's fair to the people who created the information.

I can't believe there are so many apologists on HN for what amounts to vacuuming up peoples data for financial gain.

▲

jsheard 8 months ago | parent | next [-]

The OpenAI that is assumed to keep being able to harvest every form of IP without compensation is valued at $150B, an OpenAI that has to pay for data would be worth significantly less. They're currently not even expecting to turn a profit until 2029, and that's without paying for data.

https://finance.yahoo.com/news/report-reveals-openais-44-bil...

▲

suby 8 months ago | parent | prev | next [-]

OpenAI is not profitable, and to achieve what they have achieved they had to scrape basically the entire internet. I don't have a hard time believing that OpenAI could not exist if they had to respect copyright.

https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...

▲

noitpmeder 8 months ago | parent | next [-]

That's a good thing! If a company cannot raise to fame unless they violate laws, it should not have been there.

There is plenty of public domain text that could have taught a LLM English.

▲

suby 7 months ago | parent | next [-]

I'm not convinced that the economic harm to content creators is greater than the productivity gains and accessibility of knowledge for users (relative to how competent it would be if trained just on public domain text). Personally, I derive immense value from ChatGPT / Claude. It's borderline life changing for me.

As time goes on, I imagine that it'll increasingly be the case that these LLM's will displace people out of their jobs / careers. I don't know whether the harm done will be greater than the benefit to society. I'm sure the answer will depend on who it is that you ask.

> That's a good thing! If a company cannot raise to fame unless they violate laws, it should not have been there.

Obviously given what I wrote above, I'd consider it a bad thing if LLM tech severely regressed due to copyright law. Laws are not inherently good or bad. I think you can make a good argument that this tech will be a net negative for society, but I don't think it's valid to do so just on the basis that it is breaking the law as it is today.

▲

DrillShopper 7 months ago | parent [-]

> I'm not convinced that the economic harm to content creators is greater than the productivity gains and accessibility of knowledge for users (relative to how competent it would be if trained just on public domain text).

Good thing whether or not something is a copyright violation doesn't depend on if you can make more money with someone else's work than they can.

	▲	suby 7 months ago \| parent [-]
		I understand the anger about large tech companies using others work without compensation, especially when both they and their users benefit financially. But this goes beyond economcis. LLM tech could accelerate advances in medicine and technology. I strongly believe that we're going to see societal benefits in education, healthcare, especially mental health support thanks to this tech. I also think that someone making money off LLM's is a separate question from whether or not the original creator has been harmed. I think many creators are going to benefit from better tools, and we'll likely see new forms of creation become viable. We already recognize that certain uses of intellectual property should be permitted for societies benefit. We have fair use doctrine, patent compulsory licensing for public health, research exmpetions, and public libraries. Transformative use is also permitted, and LLMs are inherently transformative. Look at the volume of data that they ingest compared to the final size of a trained model, and how fundamentally different the output format is from the input data. Human progress has always built upon existing knowledge. Consider how both Darwin and Wallace independently developed evolution theory at roughly the same time -- not from isolation, but from building on the intellectual foundation of their era. Everything in human culture builds on what came before. That all being said, I'm also sure that this tech is going to negative impact people too. Like I said in the other reply, whether or not this tech is good or bad will depend on who you ask. I just think that we should weigh these costs against the potential benefits to society as a whole rather than simply preserving existing systems, or blindly following the law as if the law is inherently just or good. Copyright law was made before this tech was even imagined, and it seems fair to now evaluate whether the current copyright regime makes sense if it turns out that it'd keep us in some local maximum.

▲

YetAnotherNick 7 months ago | parent | prev [-]

> unless they violate laws

*unless they violate country laws.

Which means openAI or its alternative could survive in China but not in US. The question is that if we are fine with it?

▲

jpalawaga 8 months ago | parent | prev [-]

technically open ai has respected copyright, except in the (few) instances they produce non-fair-use amounts of copyrighted material.

dmca does not cover scraping.

▲

mrweasel 8 months ago | parent | prev | next [-]

That's not real money tough. You need actual cash on hand to pay for stuff, OpenAI only have the money they've been given by investors. I suspect that many of the investors wouldn't have been so keen if they knew that OpenAI would need an additional couple of billions a year to pay for data.

	▲	__loam 7 months ago \| parent [-]
		That's too bad that your business isn't viable without the largest single violation of copyright of all time.

▲

nickpsecurity 8 months ago | parent | prev | next [-]

That doesn’t mean they have $150B to hand over. What you can cite is the $10 billion they got from Microsoft.

I’m sure they could use a chunk of that to buy competitive I.P. for both companies to use for training. They can also pay experts to create it. They could even sell that to others for use in smaller models to finance creating or buying even more I.P. for their models.

▲

wvenable 8 months ago | parent | prev [-]

[flagged]

▲

CJefferson 8 months ago | parent | next [-]

We can, and do, choose to treat normal people different from billion dollar companies that are attempting to suck up all human output and turn it into their own personal profit.

If they were, say, a charity doing this for the good of mankind, I’d have more sympathy. Shame they never were.

	▲	tolmasky 8 months ago \| parent [-]
		The way to treat them differently is not by making them share profits with another corporation. The logical endgame of all this isn’t “stopping LLMs,” it’s Disney happening to own a critical mass of IP to be able to legally train and run LLMs that make movies, firing all their employees, and no smaller company ever having a chance in hell with competing with a literal century’s worth of IP powering a generative model. The best party about all this is that Disney initially took off by… making use of public domain works. Copyright used to last 14 years. You’d be able to create derivative works of most the art in your life at some point. Now you’re never allowed to. And more often than not, not to grant a monopoly to the “author”, but to the corporation that hired them. The correct analysis shouldn’t be OpenAI vs. Intercept or Disney of whomever. You’re just choosing kings at that point.

▲

IsTom 8 months ago | parent | prev | next [-]

> produced "a unique" song?

People do get sued for making songs that are too similar to previously made songs. One defence available is that they've never heard it themselves before.

If you want to treat AI like humans then if AI output is similar enough to copyrighted material it should get sued. Then you try to prove that it didn't ingest the original version somehow.

▲

noitpmeder 8 months ago | parent [-]

The fact that these lawsuits aren't as simple as "is my copywrited work in your training set, yes or no" is boggling.

	▲	__loam 7 months ago \| parent [-]
		I feel like at some point the people in favor of this are going to realize that whether the data was ingested into a training set is completely immaterial to the fact that these companies downloaded data they don't have a license to use to a company server somewhere with the intention to use it for commercial use.

▲

GeoAtreides 8 months ago | parent | prev | next [-]

Ah yes, humans and LLMs are exactly the same, learning the same way, reasoning the same way, they're practically indistinguishable. So that's why it makes sense to equate humans reading books with computer programs ingesting and processing the equivalent of billions of books in literal days or months.

▲

Timwi 8 months ago | parent [-]

While I agree with your sentiment in general, this thread is about the legal situation and your argument is unfortunately not a legal one.

▲

anileated 8 months ago | parent [-]

“A person is fundamentally different from an LLM” does not need a legal argument and is implied by the fact that LLMs do not have human rights, or even anything comparable to animal rights.

A legal argument would be needed to argue the other way. This argument would imply granting LLMs some degree of human rights, which the very industry profiting from these copyright violations will never let happen for obvious reasons.

▲

notahacker 8 months ago | parent [-]

The other problem with the legal argument that it's "just like a person learning" is that corporations whose human employees have learned what copyrighted characters look like and then start incorporating them into their art are considered guilty of copyright violation, and don't get to deploy the "it's not an intentional copyright violation from someone who should have known better, it's just a tool outputting what the user requested" defence...

	▲	anileated 7 months ago \| parent [-]
		Exactly. Also, it is only a matter of time until one of those employees (thanks to free will and agency) will whistleblow, it doesn’t scale, etc. Frankly, the fact that such a big segment of HN crowd unthinkingly buys big tech’s double standard (LLMs are human when copyright is concerned, but not human in every other sense) makes me ashamed of the industry.

▲

mongol 8 months ago | parent | prev | next [-]

The process of reading it into their training data is a way of copying it. It exists somewhere and they need to copy it in order to ingest it.

▲

wvenable 8 months ago | parent [-]

By that logic you're violating copyright by using a web browser.

▲

Suppafly 8 months ago | parent | next [-]

>By that logic you're violating copyright by using a web browser.

You would be except for the fact that publishing stuff on the web gives people an implicit license to download it for the purposes of viewing it.

▲

Timwi 8 months ago | parent | next [-]

Not sure about US or other jurisdictions, but that's not how any of this works in Germany. In Germany downloading anything from anywhere (even a movie) is never illegal and does not require a license. What's illegal is publishing/disseminating copyrighted content without authorization. BitTorrenting a movie is illegal because you're distributing it to other torrenters. Streaming a movie on your website is illegal because it's public. You can be held liable for using a photo from the web to illustrate your eBay auction, not because you downloaded it but because you republished it.

OpenAI (and Google and everyone else) is creating a publicly-accessible system that produces output that could be derived from copyrighted material.

	▲	Suppafly 7 months ago \| parent \| next [-]
		I think it works like that in Canada and some other places too, because they pay an extra tax on storage media when they buy it, which essentially authorizes a license for any copyrighted material that might be stored on that media.
	▲	Tomte 7 months ago \| parent \| prev [-]
		> In Germany […] That‘s confidently and completely wrong.

▲

wvenable 7 months ago | parent | prev [-]

I'm only allowed to view it? I can't download it, convert each word into a color, and create a weird piece of art work out of it? I think I can.

	▲	Suppafly 7 months ago \| parent [-]
		>convert each word into a color, and create a weird piece of art work out of it? I think I can. I agree, but the original author might get butthurt if you distribute it. Realistically copyright law in the US is a mess when it comes to weird pieces of art.

▲

__loam 7 months ago | parent | prev [-]

The nature of the copy does actually matter.

▲

DrillShopper 7 months ago | parent | prev | next [-]

> You read books and now you have a job? Pay up.

It is disingenuous to imply the scale of someone buying books and reading them (for which the publisher and author are compensated) or borrowing them from the library and reading them (again, for which the publisher and author are compensated) is the same as the wholesale copying without permission or payment of anything not behind a pay wall on the Internet.

▲

8 months ago | parent | prev [-]

[deleted]