Are you going to examine a few petabytes of data for each model you want to run, to check if a random paragraph from Main Kampf is in there? How?

We need better tools to examine the weights (what gets activated to which extent for which topics, for example). Getting full training corpus, while nice, cannot be our only choice.

▲

amelius 4 days ago | parent [-]

> Are you going to examine a few petabytes of data for each model (...) How?

I can think of a few ways. Perhaps I'd use an LLM to find objectionable content. But anyway, it is the same argument as you can have against e.g. the Linux kernel. Are you going to read every line of code to see if it is secure? Maybe, or maybe not, but that is not the point.

The point is now a model is a black box. It might as well be a Trojan horse.

▲

Ancapistani 4 days ago | parent | next [-]

Let's pretend for a moment that the entire training corpus for Deepseek-R1 were released.

How would you download it?

Where would you store it?

	▲	lrvick 3 days ago \| parent [-]
		I mean many people I know have 100tb+ in storage at home now. A large enough team of dedicated community members cooperating and sharing compute resources online should be able to reproduce any model.

▲

senko 4 days ago | parent | prev [-]

You would use an LLM to process a few petabytes of data to find a needle in the haystack?

Cheaper to train your own.