Remix.run Logo
est 20 hours ago

> I'm going to feed all of my business's data to it

Your business data is probably worthless, even considered harmful for the pretrain corpus.

Your interactions and decision making process are most valuable parts of the whole business.

bandrami 19 hours ago | parent | next [-]

I assure you my business's data is not remotely worthless which is why there are pretty strict laws and regulations about what we can do with it

TZubiri 18 hours ago | parent | prev [-]

>Your business data is probably worthless

please tell me you are not in charge of the data of any business I'm a client of

elpocko 17 hours ago | parent | next [-]

Could be! Let's check. I just need your name and address, your SSN, a list of businesses you are a client of, and a DNA sample.

est 18 hours ago | parent | prev [-]

to clarify, probably worthless to AI vendors, but might be useful for third-parties.

TZubiri 18 hours ago | parent [-]

Third parties that can be clients of the AI vendor...

selcuka 16 hours ago | parent [-]

If it's worthless to AI vendors, they won't include it in the training corpus, so third parties won't have access to it.

estearum 14 hours ago | parent | next [-]

They're alluding to something more like espionage of just selling the interesting stuff you put in the text box.

TZubiri 7 hours ago | parent [-]

Wow I thought this was quite obvious, apparently not, so I'll explain.

Llm provider sells usage of their model. You use it to write code. Other clients use it to write code as well. If the llm provider trains with user data, then the usage benefits other users. If you pay the company to generate code,then by definition it is useful, and highly likely that other customers care about it.

Replace writing code with anything, a lawyer, a psychologist, a confessional. The IO is inherently useful to users of the same category.

That is to say nothing of adversarial use, that is, being useful because a counterparty might find it useful, so an attacker might find common code patterns, a lawyer might see what the opposition might be advised, a boy might see what a girl asks or gets advised, etc..

If this sounds too complex to you, just think of training on data as exfiltration with added steps, because that's what it is

estearum 7 hours ago | parent [-]

Oh well this is a bad argument. I made a mistake by assuming you made a good argument instead.

bandrami 13 hours ago | parent | prev | next [-]

The worry is direct exfiltration, not training

TZubiri 7 hours ago | parent | prev [-]

But it isn't worthless because the user is paying for that, and third parties are paying for that as well. Unless the input output is completely different, which it's not because you are human, and I bet you have a profession which other humans have, and many other qualities which you share with other humans.

In any case, relying on the chance that the LLM inference won't train on your data because of it's presumably low value is as good a strategy as crossing your fingers or venerating the god of rain. You should be relying on contractual clauses at least when including professional and client data.