Remix.run Logo
Aurornis 3 days ago

> If I am unable to convince you to stop meticulously training the tools of the oppressor (for a fee!) then I just ask you do so quietly.

I'm kind of fascinated by how AI has become such a culture war topic with hyperbole like "tools of the oppressor"

It's equally fascinating how little these comments understand about how LLMs work. Using an LLM for inference (what you do when you use Claude Code) does not train the LLM. It does not learn from your code and integrate it into the model while you use it for inference. I know that breaks the "training the tools of the oppressor" narrative which is probably why it's always ignored. If not ignored, the next step is to decry that the LLM companies are lying and are stealing everyone's code despite saying they don't.

meowkit 3 days ago | parent | next [-]

We are not talking about inference.

The prompts and responses are used as training data. Even if your provider allows you to opt out they are still tracking your usage telemetry and using that to gauge performance. If you don’t own the storage and compute then you are training the tools which will be used to oppress you.

Incredibly naive comment.

Aurornis 3 days ago | parent [-]

> The prompts and responses are used as training data.

They show a clear pop-up where you choose your setting about whether or not to allow data to be used for training. If you don't choose to share it, it's not used.

I mean I guess if someone blindly clicks through everything and clicks "Accept" without clicking the very obvious slider to turn it off, they could be caught off guard.

Assuming everyone who uses Claude is training their LLMs is just wrong, though.

Telemetry data isn't going to extract your codebase.

lukan 3 days ago | parent [-]

"If you don't choose to share it, it's not used"

I am curious where your confidence that this is true, is coming from?

Besides lots of GPU's, training data seems the most valuable asset AI companies have. Sounds like strong incentive to me to secretly use it anyway. Who would really know, if the pipelines are set up in a way, if only very few people are aware of this?

And if it comes out "oh gosh, one of our employees made a misstake".

And they already admitted to train with pirated content. So maybe they learned their lesson .. maybe not, as they are still making money and want to continue to lead the field.

simonw 3 days ago | parent | next [-]

My confidence comes from the following:

1. There are good, ethical people working at these companies. If you were going to train on customer data that you had promised not to train on there would be plenty of potential whistleblowers.

2. The risk involved in training on customer data that you are contractually obliged not to train on is higher than the value you can get from that training data.

3. Every AI lab knows that the second it comes out that they trained on paying customer data saying they wouldn't, those paying customers will leave for their competitors (and sue them int the bargain.)

4. Customer data isn't actually that valuable for training! Great models come from carefully curated training data, not from just pasting in anything you can get your hands on.

Fundamentally I don't think AI labs are stupid, and training on paid customer data that they've agreed not to train on is a stupid thing to do.

RodgerTheGreat 3 days ago | parent | next [-]

1. The people working for these companies are already demonstrably ethically flexible enough to pirate any publicly accessible training data they can get their hands on, including but not limited to ignoring the license information in every repo on GitHub. I'm not impressed with any of these clowns and I wouldn't trust them to take care of a potted cactus.

2. The risk of using "illegal" training data is irrelevant, because no GenAI vendors have been meaningfully punished for violating copyright yet, and in the current political climate they don't expect to be anytime soon. Even so,

3. Presuming they get caught redhanded using personal data without permission- which, given the nature of LLMs would be extremely challenging for any individual customer to prove definitively- they may lose customers, and customers may try to sue, but you can expect those lawsuits to take years to work their way through the courts; long after these companies IPO, employees get their bag, and it all becomes someone else's problem.

4. The idea of using carefully curated datasets is popular rhetoric, but absolutely does not reflect how the biggest GenAI vendors do business. See (1).

AI labs are extremely shortsighted, sloppy, and demonstrably do not care a single iota about the long term when there's money to be made in the short term. Employees have gigantic financial incentives to ignore internal malfeasance or simple ineptitude. The end result is, if anything, far worse than stupidity.

simonw 3 days ago | parent [-]

There is an important difference between openly training on scraped web data and license-ignored data from GitHub and training on data from your paying customers that you promised you wouldn't train on.

Anthropic had to pay $1.5bn after being caught downloading pirated ebooks.

lunar_mycroft 3 days ago | parent [-]

So Anthropic had to pay less than 1% of their valuation despite approximately their entire business being dependent on this and similar piracy. I somehow doubt their takeaway from that is "let's avoid doing that again".

ben_w 2 days ago | parent | next [-]

Two things:

First: Valuations are based on expected future profits.

For a lot of companies, 1% of valuation is ~20% of annual profit (P/E ratio 5); for fast growing companies, or companies where the market is anticipating growth, it can be a lot higher. Weird outlier example here, but consider that if Tesla was fined 1% of its valuation (1% of 1.5 trillion = 15 billion), that would be most of the last four quarter's profit on https://www.macrotrends.net/stocks/charts/TSLA/tesla/gross-p...

Second: Part of the Anthropic case was that many of the books they trained on were ones they'd purchased and destructively scanned, not just pirated. The courts found this use was fine, and Anthropic had already done this before being ordered to: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

simonw 3 days ago | parent | prev [-]

Their main takeaway was that they should legally buy paper books, chop the spines off and scan those for training instead.

lunar_mycroft 3 days ago | parent | prev [-]

Every single point you made is contradicted by the observed behavior of the AI labs. If any of those factors were going to stop them from training on data they legally can't, they would have done so already.

Aurornis 3 days ago | parent | prev | next [-]

> I am curious where your confidence that this is true, is coming from?

My confidence comes from working in big startups and big companies with legal teams. There's no way the entire company is going to gather all of the engineers and everyone around, have them code up a secret system to consume customer data into a secret part of the training set, and then have everyone involved keep quiet about it forever.

The whistleblowing and leaking would happen immediately. We've already seen LLM teams leak and and have people try to whistleblow over things that aren't even real, like the Google engineer who thought they had invented AGI a few years ago (lol). OpenAI had a public meltdown when the employees disagreed with Sam Altman's management style.

So my question to you is: What makes you think they would do this? How do you think they'd coordinate the teams to keep it all a secret and only hire people who would take this secret to their grave?

lukan 3 days ago | parent [-]

"There's no way the entire company is going to gather all of the engineers and everyone around, have them code up a secret system "

No, that is why I wrote

"Who would really know, if the pipelines are set up in a way, that only very few people are aware of this?" (Typo fixed)

There is no need for everyone to know. I don't know their processes, but I can think of ways to only include very few people who need to know.

The rest is just working on everything else. Some work with data, where they don't need to know where it came from, some with UI, some with scaling up, some .. they all don't need to know, that the source of DB XYZ comes from a dark source.

theshrike79 2 days ago | parent | prev | next [-]

> I am curious where your confidence that this is true, is coming from?

We have a legal binding contract with Anthropic. Checked and vetted by our laywers, who are annoying because they actually READ the contracts and won't let us use services with suspicious clauses in them - unless we can make amendments.

If they're found to be in breach of said contract (which is what every paid user of Claude signs), Anthropic is going to be the target of SO FUCKING MANY lawsuits even the infinite money hack of AI won't save them.

lukan 2 days ago | parent [-]

Are you refering to the standard contract/terms of use, or does your company has a special contract made with them?

ben_w 3 days ago | parent | prev [-]

> Besides lots of GPU's, training data seems the most valuable asset AI companies have. Sounds like strong incentive to me to secretly use it anyway. Who would really know, if the pipelines are set up in a way, if only very few people are aware of this?

Could be, but it's a huge risk the moment any lawsuit happens and the "discovery" process starts. Or whistleblowers.

They may well take that risk, they're clearly risk-takers. But it is a risk.

yunwal 3 days ago | parent | next [-]

Eh they’re all using copyrighted training data from torrent sites anyway. If the government was gonna hold them accountable for this it would have happened already.

ragequittah 3 days ago | parent | next [-]

You're probably right [1]

[1]https://www.cbc.ca/news/business/anthropic-ai-copyright-sett...

ben_w 3 days ago | parent | prev [-]

The piracy was found to be unlawful copyright infringement.

The training was OK, but the piracy wasn't, they were held accountable for that.

blibble 3 days ago | parent | prev [-]

the US no longer has any form of rule of law

so there's no risk

ben_w 3 days ago | parent | next [-]

The USA is a mess that's rapidly getting worse, but it has not yet fallen that far.

Aurornis 3 days ago | parent | prev [-]

> the US no longer has any form of rule of law

AI threads really bring out the extreme hyperbole and doomerism.

biammer 3 days ago | parent | prev [-]

I understand how these LLMs work.

I find it hard to believe there are people who know these companies stole the entire creative output of humanity and egregiously continually scrape the internet are, for some reason, ignoring the data you voluntarily give them.

> I know that breaks the "training the tools of the oppressor" narrative

"Narrative"? This is just reality. In their own words:

> The awards to Anthropic, Google, OpenAI, and xAI – each with a $200M ceiling – will enable the Department to leverage the technology and talent of U.S. frontier AI companies to develop agentic AI workflows across a variety of mission areas. Establishing these partnerships will broaden DoD use of and experience in frontier AI capabilities and increase the ability of these companies to understand and address critical national security needs with the most advanced AI capabilities U.S. industry has to offer. The adoption of AI is transforming the Department’s ability to support our warfighters and maintain strategic advantage over our adversaries [0]

Is 'warfighting adversaries' some convoluted code for allowing Aurornis to 'see a 1337x in productivity'?

Or perhaps you are a wealthy westerner of a racial and sexual majority and as such have felt little by way of oppression by this tech?

In such a case I would encourage you to develop empathy, or at least sympathy.

> Using an LLM for inference .. does not train the LLM.

In their own words:

> One of the most useful and promising features of AI models is that they can improve over time. We continuously improve our models through research breakthroughs as well as exposure to real-world problems and data. When you share your content with us, it helps our models become more accurate and better at solving your specific problems and it also helps improve their general capabilities and safety. We do not use your content to market our services or create advertising profiles of you—we use it to make our models more helpful. ChatGPT, for instance, improves by further training on the conversations people have with it, unless you opt out.

[0] https://www.ai.mil/latest/news-press/pr-view/article/4242822...

[1] https://help.openai.com/en/articles/5722486-how-your-data-is...

ben_w 3 days ago | parent [-]

> Is 'warfighting adversaries' some convoluted code for allowing Aurornis to 'see a 1337x in productivity'?

Much as I despair at the current developments in the USA, and I say this as a sexual minority and a European, this is not "tools of the oppressor" in their own words.

Trump is extremely blunt about who he wants to oppress. So is Musk.

"Support our warfighters and maintain strategic advantage over our adversaries" is not blunt, it is the minimum baseline for any nation with assets anyone else might want to annex, which is basically anywhere except Nauru, North Sentinel Island, and Bir Tawil.

biammer 3 days ago | parent [-]

> "Support our warfighters and maintain strategic advantage over our adversaries" is not blunt, it is the minimum baseline for any nation with assets anyone else might want to annex

I think its gross to distill military violence as defending 'assets [others] might want to annex'.

What US assets were being annexed when US AI was used to target Gazans?

https://apnews.com/article/israel-palestinians-ai-technology...

> Trump is extremely blunt about who he wants to oppress. So is Musk.

> our adversaries" is not blunt

These two thoughts seem at conflict.

What 'assets' were being protected from annexation here by this oppressive use of the tool? The chips?

https://www.aclu.org/news/privacy-technology/doritos-or-gun

ben_w 3 days ago | parent | next [-]

> I think its gross to distill military violence as defending 'assets [others] might want to annex'.

Yes, but that's how the world works:

Another country wants a bit of your country for some reason, they can take it by force unless you can make at the very least a credible threat against them, sometimes a lot more than that.

Note that this does not exclude that there has to be an aggressor somewhere. I'm not excluding the existence of aggressors, nor the capacity for the USA to be an aggressor. All I'm saying is your quotation is so vague as to also encompass those who are not.

> What US assets were being annexed when US AI was used to target Gazans?

First, I'm saying the statement is so broad as to encompass other things besides being a warmonger. Consider the opposite statement: "don't support our warfighters and don't maintain strategic advantage over our adversaries" would be absolutely insane, therefore "support our warfighters and maintain strategic advantage over our adversaries" says nothing.

Second, in this case the country doing the targeting is… Israel. To the extent that the USA cares at all, it's to get votes from the large number of Jewish people living in the USA. Similar deal with how it treats Cuba since the fall of the USSR: it's about votes (from Cuban exiles in that case, but still, votes).

Much as I agree that the conduct of Israel with regard to Gaza was disproportionate, exceeded the necessity, and likely was so bad as to even damage Israel's long-term strategic security, if you were to correctly imagine the people of Israel deciding "don't support our warfighters and don't maintain strategic advantage over our adversaries", they would quickly get victimised much harder than those they were victimising. That's the point there: the quote you cite as evidence, is so broad that everyone has approximately that, because not having it means facing ones' own destruction.

There's a mis-attributed quote, "People sleep peaceably in their beds at night because rough men stand ready to do violence on their behalf", that's where this is at.

> These two thoughts seem at conflict.

Musk is openly and directly saying "Canada is not a real country.", says "cis" is hate speech, response to pandemic was tweeting "My pronouns are Prosecute/Fauci.", and self-justification for his trillion dollar bonus for hitting future targets is wanting to be in control of what he describes as a "robot army"; Trump openly and explicitly wants the USA to annex Canada, Greenland, Panama canal, is throwing around the national guard, openly calls critics traitors and calls for death penalty. They're a subtle as exploding volcanoes, nobody needs to take the worst case interpretations of what they're saying to notice this.

Saying "support our warfighters" is something done by basically every nation everywhere all the time, because those places that don't do this quickly get taken over by nearby nations who sense weakness. Which is kinda how the USA got Texas, because again, I'm not saying the USA is harmless, I'm saying the quote doesn't show that.

> What 'assets' were being protected from annexation here by this oppressive use of the tool? The chips?

This would have been a much better example to lead with than the military stuff.

I'm absolutely all on board with the general consensus that the US police are bastards in this specific way, have been since that kid got shot for having a toy gun in an open-carry state. (I am originally from a country where even the police are not routinely armed, I do not value the 2nd amendment, but if you're going to say "we allow open carry of firearms" you absolutely do not get to use "we saw someone carrying a firearm" as an excuse to shoot them).

However: using LLMs to code doesn't seem to be likely to make a difference either way for this. If I was writing a gun-detection AI, perhaps I'm out of date, but I'd use a simpler model that runs locally on-device and doesn't do anything else besides the sales pitch.

cindyllm 3 days ago | parent | prev [-]

[dead]