I'm surprised that people here don't care at all about these models openly training on your data, especially if you use them straight from the model developer. Whereas things like "GitHub now automatically opts everyone into using their code for model training" get hundreds of justifiably angry comments, I never see this brought up anymore on posts like these talking about using Chinese models through OpenRouter. This might be explained by "well they're different people", but the difference is very stark for that to be the whole explanation.

▲

dbeley 5 hours ago | parent | next [-]

The cool thing about open-weights model is that you are free to use alternative providers that won't phone home to the original model creators.

I see 6 alternative providers listed on Openrouter for DeepSeek V4 Pro for example.

	▲	eckelhesten 3 hours ago \| parent [-]
		At least that’s what they’re telling you. It’s a ”trust me bro” scenario. I’d rather use the phone home version (deepseeks own endpoint). The benefit is that I’m fairly certain that they actually host the model I’m paying for.

▲

pheggs 7 hours ago | parent | prev | next [-]

I am personally okay helping them as long as they publish the models and dont keep them closed. And I dont trust the settings where providers say they wont train on it.

▲

never_inline an hour ago | parent | prev | next [-]

I am fine with them training on my open source code (which is pretty bad but not the point, because they're providing the service for free). I will be super pissed if I pay for enterprise and they train on it though. I believe this is the opinion of majority programmers.

▲

gmerc 6 hours ago | parent | prev | next [-]

Because they give it away for free and offer APIs at very acceptable rates. Not that hard to figure out, Robin Hood stealing our data tax back comes to mind.

▲

deaux 5 hours ago | parent [-]

GitHub is free.

▲

notrealyme123 5 hours ago | parent [-]

User publishes to github => Copilot trains with GitHub data => MS Sells copilot => User workes for Microsoft (in the sense of giving it's labour for MS to make money)

User publishes to github => Deepseek trains with GitHub data => Deepseek gives model away for free => User did not work for Deepseek (in the sense of giving it's labour for Deepseek to make money)

	▲	deaux 3 hours ago \| parent \| next [-]
		In the first case MS is giving part of Github itself away for free.
	▲	arikrahman 5 hours ago \| parent \| prev [-]
		Exactly, it's intuitively different.

▲

vagrantJin 3 hours ago | parent | prev | next [-]

You definitely have a bone to pick. Chinese researchers usually have given the world the most cheap and consistent high quality research around LLMs. They don't pretend, they do the work and release the goodies. Mostly so cheap, every one in the world has a chance to use close to frontier models. Why would you respond with "Anger"?

You let us know what your real complaint is about and let's not feign indignation at open models and research.

▲

deaux 3 hours ago | parent [-]

You're making completely unfounded assumptions about me. I use Chinese models myself.

	▲	vagrantJin a few seconds ago \| parent [-]
		I made no such claims. Maybe you have something to share about why we need to have a negative view of free and open models based on publicly available frontier research.

▲

prism56 6 hours ago | parent | prev | next [-]

If the data is opensource on github, then in my opinion it should be fair game.

▲

ozgrakkurt 6 hours ago | parent | next [-]

IMO this is unfair for GPL or similarly licensed code.

Seems ok for MIT like licensed code though

▲

ForHackernews 5 hours ago | parent | next [-]

It's totally fair to use GPL code, it just means all the models built by Anthropic, OpenAI, etc. using GPL-licensed source are themselves bound by the GPL. Plus, any works created downstream using those AI tools.

We're on the verge of a golden age of software as soon as someone finds a court with courage.

	▲	duskdozer 4 hours ago \| parent [-]
		Ah, you have much more faith in the legal system than I do. It's nice to dream, though.

▲

edg5000 4 hours ago | parent | prev [-]

I think AI will create an open source dark age. Gradually, we'll see a lot less new good open source code. A gradual shift back to the proprietary world. Simmilar to the 1950-1990 period.

▲

notrealyme123 5 hours ago | parent | prev [-]

Things being public should not be enough. just because someone leaked your medical information to the public via a data breach should not make it fair game. There should be some rules.

	▲	prism56 5 hours ago \| parent \| next [-]
		I feel that's a false dichotomy. The code on github is freely available for people to read and learn from, leaked medical data isn't.
	▲	prism56 5 hours ago \| parent \| prev [-]
		I feel that's a flase dichotomy. The code visible on github is freely available for anyone to read and learn from.

▲

duskdozer 5 hours ago | parent | prev | next [-]

What do you mean specifically? Data passed through OpenRouter? Or that they too indiscriminately ingest data all over the web? If the former, I assume it's just that anyone still using them just doesn't care where the data comes from. If the latter, well, it seems like every day there's some news on some new model from somewhere, and it takes dedication to complain every time. There's also the factor that I believe DeepSeek is more open with the model, while others keep it entirely proprietary, which feels fairer and (personally) is also less offensive.

▲

antiloper 6 hours ago | parent | prev | next [-]

AWS Bedrock has DeepSeek models running on their infrastructure. That should be enough to prevent training on user data (there's a markup compared to DeepSeek's pricing though).

And unfortunately AWS doesn't have prepaid billing, so you can't just give the internet access to your API key without getting FinDDoS'd.

	▲	ThreatSystems 4 hours ago \| parent \| next [-]
		If anyone is looking for a solution in this space. Fire me an email, I have a partner whose focussed closely on that problem set!
	▲	deaux 6 hours ago \| parent \| prev [-]
		The latest one available for serverless inference looks to be from 8 months (Deepseek v3.1), which is an eternity and far behind.

▲

edg5000 4 hours ago | parent | prev | next [-]

My policy is that I don't allow agents to access all code. Some of it is shielded behind bind mounts. Maybe this is a pathetic, artisanal (or ego-driven), reaction of mine to the inevitable. I allow them to work on about 90% of the code (most codebases fully), with some code being considered too valuable to expose to the vendor. When data is involved, LLMs only get to see anonymized data.

This cute policy of mine won't affect anything though. The more we use the models, the more the models will replace this kind of work. Centralisation of power is inevitable; in Medival Europe, we used to have state & church ruling. In modern times but before the internet, it was probably state and banks. Maybe with ongoing digitization (bank offices disappearing) making banks less costly to operate; combined with with bank bailouts, maybe govenments will fully nationalize or at least banks will consolidate.

Then the AI companies will consolidate with the internet information and communication companies (Google/Meta for the US, and Alibaba/Tencent for China). Maybe we'll end up with a few de-facto governmental megacorps that rule in tandem and close cooperation with the formal government, who might handle mostly infra, utilities and the army. The megacorp would control narrative more and take more of a paternal role (educating and protecting the citizens, normally handled by formal governments).

Does this make sense?

▲

eckelhesten 3 hours ago | parent | prev | next [-]

As opposed to?

Do you really think OpenAI, Anthropic or any other entity in the same business respects your data?

The Chinese AI companies who release open weights actually deserve whatever input you give them. They are the reason why there is competition and not duopolies in the domain.

▲

deaux 3 hours ago | parent [-]

I think Google, and likely Anthropic, indeed do honor the settings chosen by the user. For Google in particular it'd be very surprising if they didn't. That's also why both do everything they can to trick users into allowing it.

OpenAI, I wouldn't be surprised if you were right.

	▲	pheggs 2 hours ago \| parent \| next [-]
		unfortunately the history of these big tech companies has shown that they do not care about data privacy and are even willing to lie about it. but I guess its irrelevant, in practice you have to assume the worst anyway since there is no way to verify it
	▲	eckelhesten an hour ago \| parent \| prev [-]
		The models doesn’t get better by themselves. You’re naive.

▲

raincole 5 hours ago | parent | prev | next [-]

Two factors. First is anti-americanism (or at least anti-american-capitalism).

But the more important one is the social contract. Github came far before LLM era. The branding around it is being the storage of open source projects and many users want to it stay away from AI hype. You won't expect LLM providers to stay away from AI hype (duh) so it's less an issue for them.

▲

stavros 4 hours ago | parent | prev [-]

If they give me the resulting model in the end, they can train on my data all they want. Hell, I'll send them more of it.