Remix.run Logo
simonw 3 days ago

My confidence comes from the following:

1. There are good, ethical people working at these companies. If you were going to train on customer data that you had promised not to train on there would be plenty of potential whistleblowers.

2. The risk involved in training on customer data that you are contractually obliged not to train on is higher than the value you can get from that training data.

3. Every AI lab knows that the second it comes out that they trained on paying customer data saying they wouldn't, those paying customers will leave for their competitors (and sue them int the bargain.)

4. Customer data isn't actually that valuable for training! Great models come from carefully curated training data, not from just pasting in anything you can get your hands on.

Fundamentally I don't think AI labs are stupid, and training on paid customer data that they've agreed not to train on is a stupid thing to do.

RodgerTheGreat 3 days ago | parent | next [-]

1. The people working for these companies are already demonstrably ethically flexible enough to pirate any publicly accessible training data they can get their hands on, including but not limited to ignoring the license information in every repo on GitHub. I'm not impressed with any of these clowns and I wouldn't trust them to take care of a potted cactus.

2. The risk of using "illegal" training data is irrelevant, because no GenAI vendors have been meaningfully punished for violating copyright yet, and in the current political climate they don't expect to be anytime soon. Even so,

3. Presuming they get caught redhanded using personal data without permission- which, given the nature of LLMs would be extremely challenging for any individual customer to prove definitively- they may lose customers, and customers may try to sue, but you can expect those lawsuits to take years to work their way through the courts; long after these companies IPO, employees get their bag, and it all becomes someone else's problem.

4. The idea of using carefully curated datasets is popular rhetoric, but absolutely does not reflect how the biggest GenAI vendors do business. See (1).

AI labs are extremely shortsighted, sloppy, and demonstrably do not care a single iota about the long term when there's money to be made in the short term. Employees have gigantic financial incentives to ignore internal malfeasance or simple ineptitude. The end result is, if anything, far worse than stupidity.

simonw 3 days ago | parent [-]

There is an important difference between openly training on scraped web data and license-ignored data from GitHub and training on data from your paying customers that you promised you wouldn't train on.

Anthropic had to pay $1.5bn after being caught downloading pirated ebooks.

lunar_mycroft 3 days ago | parent [-]

So Anthropic had to pay less than 1% of their valuation despite approximately their entire business being dependent on this and similar piracy. I somehow doubt their takeaway from that is "let's avoid doing that again".

ben_w 2 days ago | parent | next [-]

Two things:

First: Valuations are based on expected future profits.

For a lot of companies, 1% of valuation is ~20% of annual profit (P/E ratio 5); for fast growing companies, or companies where the market is anticipating growth, it can be a lot higher. Weird outlier example here, but consider that if Tesla was fined 1% of its valuation (1% of 1.5 trillion = 15 billion), that would be most of the last four quarter's profit on https://www.macrotrends.net/stocks/charts/TSLA/tesla/gross-p...

Second: Part of the Anthropic case was that many of the books they trained on were ones they'd purchased and destructively scanned, not just pirated. The courts found this use was fine, and Anthropic had already done this before being ordered to: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

simonw 3 days ago | parent | prev [-]

Their main takeaway was that they should legally buy paper books, chop the spines off and scan those for training instead.

lunar_mycroft 3 days ago | parent | prev [-]

Every single point you made is contradicted by the observed behavior of the AI labs. If any of those factors were going to stop them from training on data they legally can't, they would have done so already.