You don't need to know who developed the LLM - whether it was Google or OpenAI.

What you need to know is who is the provider for the LLM, and whether their endpoints are zero data retention enabled and opted out of training. OpenRouter gives you an easy way to control this.

▲

lmf4lol 18 hours ago | parent | next [-]

This is not entirely true and ignoring a couple of potential attack vectors like Data Poisoning: https://arxiv.org/abs/2408.12798

Its of course highly dependant on the use case and the environment, but simply saying that the only important part is to know where the data goes is too simple.

▲

koiueo 18 hours ago | parent | prev [-]

How can openrouter control what LLM provider does with your data on their side?

▲

kirtivr 18 hours ago | parent [-]

OpenRouter and the provider sign a contract clearly specifying how input data is to be handled.

It's the same way we trust OpenAI to not train on our data if we've opted out although there is no control on whether they can retain the data indefinitely.

▲

lmf4lol 18 hours ago | parent | next [-]

I really dont want to be cynic but those guys gave a flying f””” about copyright while scraping the whole internet. How can I ever trust them to respect the oot-out setting. I cant. Thieves be thieves.

And even if they dont train on the data. Who guarantees us, they dont let another AI model analyse all the data, exfiltrating all kinds of intelligence and using it? I only can imagine what OpenAI and Anthropic know….

▲

astrange 17 hours ago | parent [-]

Scraping the internet isn't a copyright violation. Using it for LLM training is much more transformative than Google and Internet Archive, which are legal.

▲

jazzyjackson 11 hours ago | parent | next [-]

Your right, scraping is legally protected. It's reproducing verbatim text that's a violation, which is why LLMs still clumsily refuse to produce song lyrics. They are capable of copyright violations and have to be 'aligned' not to get their providers sued.

	▲	estearum 2 hours ago \| parent [-]
		Verbatim reproduction is neither necessary nor sufficient to create a copyright violation. "Copyright violation" is what we call the set of things that destroy the incentive for people to create original work by unduly benefitting from someone else's original work.

▲

alfiedotwtf 17 hours ago | parent | prev [-]

To be honest, this is the first time someone has spelt it out in a nicely succinct paragraph.

And just like that, I totally agree with you

	▲	estearum 14 hours ago \| parent [-]
		Except it ignores the entire premise of copyright which is to protect incentives to create original work, which Google does not destroy and which LLMs (very loudly and proudly) try to do. There are several components of the Fair Use test, "transformation" is just one of them. The most important dimension is the effect on the market, i.e. the effect on incentives. You probably shouldn't base your legal analysis on pithy internet comments regardless of how succinct or agreeable they are to you.

▲

koiueo 17 hours ago | parent | prev [-]

Contracts means shit if they are not enforceable.

Ask yourself

1. How would you know the provider has violated the contract?

2. How could you prove it?

3. Why would OpenRouter take your side in this (unlike your example with OpenAI, you're not a signing party)?

4. How would OpenRouter enforce the contract after all three above are somehow resolved in your favor?

IANAL, but IMO it's all a legal theater.

EDIT: formatting