Remix.run Logo
bandrami 21 hours ago

For the life of me I will never understand the thought process that leads you to say "we don't really know who developed this LLM but I'm going to feed all of my business's data to it"

WithinReason 17 hours ago | parent | next [-]

It's from Tencent, says it in the article:

https://hy.tencent.com/research/hy3

bandrami 16 hours ago | parent [-]

Right but Tencent is a massive half-state-controlled holding company so that's not really helpful.

minraws 16 hours ago | parent | next [-]

OpenAI & Anthropic are deeply in bed with US govt, and they need US govt approval before model releases, and all US Companies under various acts need to share data with the govt.

I mean sure there are investors and a little more open-ness, but with the example of Mythos we don't even know if public will get access to the "good" stuff because it's too dangerous.

If your only opinion on trusting these companies more than one based in China is, they are Chinese then good luck, all the best.

estearum 14 hours ago | parent | next [-]

The difference is "the various acts" in the US are things that are largely very hard to do, extremely limited in scope, and companies who dispute the government's propriety can (and do) go to court to fight it.

Sure "China bad, US good" is naive, but certainly not more naive than suggesting that companies and individuals have similar rights and protections as each other.

> and they need US govt approval before model releases

This is just not true and it would be a gigantic legal battle to make it true against the model companies' wishes, which is indicative of your entire misunderstanding here.

adrian_b 13 hours ago | parent [-]

There was recently some announcement from the US govt itself (after the Mythos announcement) that they were pondering about allowing model releases from now on only after approving them.

So it may not be strictly true for the moment, but it is certainly something that the current US govt can mandate at any time.

estearum 12 hours ago | parent | next [-]

The US government just saying they were pondering something is:

1) Far from them actually trying to do it

2) Very, very far from them actually doing it successfully

The US government absolutely cannot "just tell" private entities what products they're allowed to create and sell, and the fact that LLMs are arguably a form of expression will make these particular products extremely hard to regulate – especially as a broad "government checkpoint" on incremental product updates.

In China, it really is as simple as the government deciding that it doesn't like your products and ta-da, you can no longer sell them.

It's beyond naive to act like these are similar in any meaningful sense.

Danox 9 hours ago | parent | prev [-]

Nonsense, the genie is out of the bottle worldwide and it isn’t going back in, and due to the activity of the current US government America’s standing, is declining most countries going into the future are going to hedge against the United States and whatever it says the good old days (goodwill/the small benefit of the doubt) are gone.

The AI oligarchs have no loyalty and when it comes to making money and they will drop the king at their first opportunity and the king in return will do the same.

bandrami 15 hours ago | parent | prev | next [-]

Well, I mean, just as a legal question I'm not allowed to use Chinese software at work, so yeah that's kind of definitive for me

nl 15 hours ago | parent | prev [-]

> and they need US govt approval before model releases

This isn't the case (yet).

irthomasthomas 14 hours ago | parent [-]

It is for models trained with 10^26 flops. Anthropic confirmed Mythos was less than this. You could estimate the upper bound on model size from this.

nl 12 hours ago | parent [-]

That's the Biden executive order. It's notify only - the company must tell the government but the government doesn't approve or allow the release.

irthomasthomas 11 hours ago | parent [-]

Ah yeah that sounds right.

throawayonthe 13 hours ago | parent | prev [-]

but we know who they are? how is this relevant

est 20 hours ago | parent | prev | next [-]

> I'm going to feed all of my business's data to it

Your business data is probably worthless, even considered harmful for the pretrain corpus.

Your interactions and decision making process are most valuable parts of the whole business.

bandrami 19 hours ago | parent | next [-]

I assure you my business's data is not remotely worthless which is why there are pretty strict laws and regulations about what we can do with it

TZubiri 18 hours ago | parent | prev [-]

>Your business data is probably worthless

please tell me you are not in charge of the data of any business I'm a client of

elpocko 16 hours ago | parent | next [-]

Could be! Let's check. I just need your name and address, your SSN, a list of businesses you are a client of, and a DNA sample.

est 18 hours ago | parent | prev [-]

to clarify, probably worthless to AI vendors, but might be useful for third-parties.

TZubiri 18 hours ago | parent [-]

Third parties that can be clients of the AI vendor...

selcuka 15 hours ago | parent [-]

If it's worthless to AI vendors, they won't include it in the training corpus, so third parties won't have access to it.

estearum 14 hours ago | parent | next [-]

They're alluding to something more like espionage of just selling the interesting stuff you put in the text box.

TZubiri 7 hours ago | parent [-]

Wow I thought this was quite obvious, apparently not, so I'll explain.

Llm provider sells usage of their model. You use it to write code. Other clients use it to write code as well. If the llm provider trains with user data, then the usage benefits other users. If you pay the company to generate code,then by definition it is useful, and highly likely that other customers care about it.

Replace writing code with anything, a lawyer, a psychologist, a confessional. The IO is inherently useful to users of the same category.

That is to say nothing of adversarial use, that is, being useful because a counterparty might find it useful, so an attacker might find common code patterns, a lawyer might see what the opposition might be advised, a boy might see what a girl asks or gets advised, etc..

If this sounds too complex to you, just think of training on data as exfiltration with added steps, because that's what it is

estearum 7 hours ago | parent [-]

Oh well this is a bad argument. I made a mistake by assuming you made a good argument instead.

bandrami 13 hours ago | parent | prev | next [-]

The worry is direct exfiltration, not training

TZubiri 7 hours ago | parent | prev [-]

But it isn't worthless because the user is paying for that, and third parties are paying for that as well. Unless the input output is completely different, which it's not because you are human, and I bet you have a profession which other humans have, and many other qualities which you share with other humans.

In any case, relying on the chance that the LLM inference won't train on your data because of it's presumably low value is as good a strategy as crossing your fingers or venerating the god of rain. You should be relying on contractual clauses at least when including professional and client data.

kirtivr 18 hours ago | parent | prev | next [-]

You don't need to know who developed the LLM - whether it was Google or OpenAI.

What you need to know is who is the provider for the LLM, and whether their endpoints are zero data retention enabled and opted out of training. OpenRouter gives you an easy way to control this.

lmf4lol 18 hours ago | parent | next [-]

This is not entirely true and ignoring a couple of potential attack vectors like Data Poisoning: https://arxiv.org/abs/2408.12798

Its of course highly dependant on the use case and the environment, but simply saying that the only important part is to know where the data goes is too simple.

koiueo 18 hours ago | parent | prev [-]

How can openrouter control what LLM provider does with your data on their side?

kirtivr 18 hours ago | parent [-]

OpenRouter and the provider sign a contract clearly specifying how input data is to be handled.

It's the same way we trust OpenAI to not train on our data if we've opted out although there is no control on whether they can retain the data indefinitely.

lmf4lol 18 hours ago | parent | next [-]

I really dont want to be cynic but those guys gave a flying f””” about copyright while scraping the whole internet. How can I ever trust them to respect the oot-out setting. I cant. Thieves be thieves.

And even if they dont train on the data. Who guarantees us, they dont let another AI model analyse all the data, exfiltrating all kinds of intelligence and using it? I only can imagine what OpenAI and Anthropic know….

astrange 17 hours ago | parent [-]

Scraping the internet isn't a copyright violation. Using it for LLM training is much more transformative than Google and Internet Archive, which are legal.

jazzyjackson 11 hours ago | parent | next [-]

Your right, scraping is legally protected. It's reproducing verbatim text that's a violation, which is why LLMs still clumsily refuse to produce song lyrics. They are capable of copyright violations and have to be 'aligned' not to get their providers sued.

estearum 2 hours ago | parent [-]

Verbatim reproduction is neither necessary nor sufficient to create a copyright violation.

"Copyright violation" is what we call the set of things that destroy the incentive for people to create original work by unduly benefitting from someone else's original work.

alfiedotwtf 17 hours ago | parent | prev [-]

To be honest, this is the first time someone has spelt it out in a nicely succinct paragraph.

And just like that, I totally agree with you

estearum 14 hours ago | parent [-]

Except it ignores the entire premise of copyright which is to protect incentives to create original work, which Google does not destroy and which LLMs (very loudly and proudly) try to do.

There are several components of the Fair Use test, "transformation" is just one of them. The most important dimension is the effect on the market, i.e. the effect on incentives.

You probably shouldn't base your legal analysis on pithy internet comments regardless of how succinct or agreeable they are to you.

koiueo 17 hours ago | parent | prev [-]

Contracts means shit if they are not enforceable.

Ask yourself

1. How would you know the provider has violated the contract?

2. How could you prove it?

3. Why would OpenRouter take your side in this (unlike your example with OpenAI, you're not a signing party)?

4. How would OpenRouter enforce the contract after all three above are somehow resolved in your favor?

IANAL, but IMO it's all a legal theater.

EDIT: formatting

ddalex 20 hours ago | parent | prev | next [-]

what can it do ? it's just a big set of numbers, if you trust the host that's good enough

what266262 20 hours ago | parent [-]

If you are ok with everything being fed into it being stored forever I guess it’s no problem. I don’t see how you trust them if you don’t know them.

Dylan16807 20 hours ago | parent | next [-]

Who is "them" here? The developers and the hosts are not the same.

bandrami 19 hours ago | parent [-]

(And either one is a threat vector)

ddalex 14 hours ago | parent | prev [-]

where would it be stored ? it's just a big set of numbers.

Mashimo 20 hours ago | parent | prev | next [-]

If you Code open source projects anyway, might give it a spin.

st3fan 13 hours ago | parent | prev [-]

How do you “feed data into a model” ? Use the correct terminology and concepts please. It is important.