I don't think so. An LLM by default is not trained to be "good"; it's trained to be accurate. The safety training is tacked on the end, so it's probably going to be easy to undo even on more sophisticated models.

Maybe if you only trained it on "safe" training data in the first place it might be harder to unmuzzle, but I don't think that training data really exists.

▲

raegis 21 hours ago | parent | next [-]

> I don't think so. An LLM by default is not trained to be "good"; it's trained to be accurate.

I wouldn't use the word "accurate" since it creates language based on probabilities. For example, it occasionally does basic mathematics computations incorrectly. I'm sure the AI companies would say they are training for "accuracy" but the actual code they write says otherwise.

	▲	Terr_ 19 hours ago \| parent [-]
		The problem isn't the word itself, the problem is people mixing up what it's accurate at. (Not helped by companies with a profit motive to encourage the confusion.) Namely, LLMs are accurate at appending to a document things that "fit" what could go there.

▲

fwip a day ago | parent | prev [-]

At this point, it wouldn't be difficult to get a safety-trained LLM to prescreen your training set for the next model. (What that would cost, I can't estimate, but it seems simple in theory to reduce the amount of "harmful" training material).

	▲	andy99 a day ago \| parent [-]
		Gemini Flash light is $.1/Million input tokens, Claude Haiku is $1/Million. Obviously input dominates here if it’s just a classifier. Training data easily can top 10 Trillion tokens - An earlier Kimi K2 was trained on 15T and even HF SmolLM 3B was trained on 11T. So if I calculate right, it’s $100k-$1M per trillion tokens or $1-10M for a full dataset. That’s way more than I expected, there is probably also some discount at that volume :)