It's really quite amazing that people would actually hook an AI company up to data that actually matters. I mean, we all know that they're only doing this to build a training data set to put your business out of business and capture all the value for themselves, right?

▲

simonw 9 hours ago | parent | next [-]

A few months ago I would have said that no, Anthropic make it very clear that they don't ever train on customer data - they even boasted about that in the Claude 3.5 Sonnet release back in 2024: https://www.anthropic.com/news/claude-3-5-sonnet

> One of the core constitutional principles that guides our AI model development is privacy. We do not train our generative models on user-submitted data unless a user gives us explicit permission to do so.

But they changed their policy a few months ago so now as-of October they are much more likely to train on your inputs unless you've explicitly opted out: https://www.anthropic.com/news/updates-to-our-consumer-terms

This sucks so much. Claude Code started nagging me for permission to train on my input the other day, and I said "no" but now I'm always going to be paranoid that I miss some opt-out somewhere and they start training on my input anyway.

And maybe that doesn't matter at all? But no AI lab has ever given me a convincing answer to the question "if I discuss company private strategy with your bot in January, how can you guarantee that a newly trained model that comes out in June won't answer questions about that to anyone who asks?"

I don't think that would happen, but I can't in good faith say to anyone else "that's not going to happen".

For any AI lab employees reading this: we need clarity! We need to know exactly what it means to "improve your products with your data" or whatever vague weasel-words the lawyers made you put in the terms of service.

▲

usefulposter 8 hours ago | parent | next [-]

This would make a great blogpost.

>I'm always going to be paranoid that I miss some opt-out somewhere

FYI, Anthropic's recent policy change used some insidious dark patterns to opt existing Claude Code users in to data sharing.

https://news.ycombinator.com/item?id=46553429

>whatever vague weasel-words the lawyers made you put in the terms of service

At any large firm, product and legal work in concert to achieve the goal (training data); they know what they can get away with.

	▲	simonw 8 hours ago \| parent [-]
		I often think suspect that the goal isn't exclusively training data so much as it's the freedom to do things that they haven't thought of in the future. Imagine you come up with non-vague consumer terms for your product that perfectly match your current needs as a business. Everyone agrees to them and is happy. And then OpenAI discover some new training technique which shows incredible results but relies on a tiny slither of unimportant data that you've just cut yourself off from! So I get why companies want terms that sound friendly but keep their options open for future unanticipated needs. It's sensible from a business perspective, but it sucks as someone who is frequently asked questions about how safe it is to sign up as a customer of these companies, because I can't provide credible answers.

▲

brushfoot 9 hours ago | parent | prev | next [-]

To me this is the biggest threat that AI companies pose at the moment.

As everyone rushes to them for fear of falling behind, they're forking over their secrets. And these users are essentially depending on -- what? The AI companies' goodwill? The government's ability to regulate and audit them so they don't steal and repackage those secrets?

Fifty years ago, I might've shared that faith unwaveringly. Today, I have my doubts.

▲

hephaes7us 6 hours ago | parent | prev | next [-]

Why do you even necessarily think that wouldn't happen?

As I understand it, we'd essentially be relying on something like an mp3 compression algorithm to fail to capture a particular, subtle transient -- the lossy nature itself is the only real protection.

I agree that it's vanishingly unlikely if one person includes a sensitive document in their context, but what if a company has a project context which includes the same document in 10,000 chats? Maybe then it's more much likely that whatever private memo could be captured in training...

▲

simonw 6 hours ago | parent [-]

I did get an answer from a senior executive at one AI lab who called this the "regurgitation problem" and said that they pay very close attention to it, to the point that they won't ship model improvements if they are demonstrated to cause this.

▲

nprateem 6 hours ago | parent [-]

Lol and that was enough for you? You really think they can test every single prompt before release to see if it regurgitates stuff? Did this exec work in sales too :-D

	▲	TeMPOraL 3 hours ago \| parent \| next [-]
		They have a clear incentive to do exactly as said - regurgitation is a problem, because it indicates the model failed to learn from the data, and merely memorized it.
	▲	simonw 4 hours ago \| parent \| prev [-]
		I think they can run benchmarks to see how likely it is for prompts to return exact copies of their training data and use those benchmarks to help tune their training procedures.

▲

postalcoder 8 hours ago | parent | prev [-]

I despise the thumbs up and thumbs down buttons for the reason of “whoops I accidentally pressed this button and cannot undo it, looks like I just opted into my code being used for training data, retained for life, and having their employees read everything.”

▲

TeMPOraL 3 hours ago | parent | prev | next [-]

> I mean, we all know that they're only doing this to build a training data set

That's not a problem. It leads to better models.

> to put your business out of business and capture all the value for themselves, right?

That's both true and paranoid. Yes, LLMs subsume most of the software industry, and many things downstream of it. There's little anyone can do about it; this is what happens when someone invents a brain on a chip. But no, LLM vendors aren't gunning for your business. They neither care, nor have the capability to perform if they did.

In fact my prediction is that LLM vendors will refrain from cannibalizing distinct businesses for as long as they can - because as long as they just offer API services (broad as they may be), they can charge rent from an increasingly large amount of the software industry. It's a goose that lays golden eggs - makes sense to keep it alive for as long as possible.

▲

falloutx 9 hours ago | parent | prev | next [-]

Its impossible to explain this to the business owners, giving a company this much access cant end up well. Right now, Google, Slack, Apple have a share of the data but with this Claude can get all of that.

▲

simonw 9 hours ago | parent | next [-]

Is there a business owner alive who doesn't worry about AI companies "training on their data" at this point?

They may still decide to use the tools, but I'd be shocked if it isn't something they are thinking about.

▲

cc62cf4a4f20 9 hours ago | parent | prev [-]

We've seen this playbook with social media - be nice and friendly until they let you get close enough to stick the knife in.

	▲	TeMPOraL 3 hours ago \| parent [-]
		Doesn't matter to 99.99% of businesses using social media. Only to the silly ones who decided to use a platform to compete with the platform itself, and to the ones that make a platform their critical dependency without realizing they're making a bet, then being surprised by it not panning out.

▲

bearjaws 4 hours ago | parent | prev [-]

This is the AI era equal to "I can't share my ideas because you will steal them"

Reality is good ideas and a few SOPs do not make a successful business.