it seems like lots of this is in distribution and that's somewhat the problem. the Internet contains knowledge of how to make a bomb, and therefore so does the llm

▲

xg15 9 hours ago | parent [-]

Yeah, seems it's more "exploring the distribution" as we don't actually know everything that the AIs are effectively modeling.

▲

lawlessone 8 hours ago | parent [-]

Am i understanding correctly that in distribution means the text predictor is more likely to predict bad instructions if you already get it to say the words related to the bad instructions?

	▲	andy99 7 hours ago \| parent [-]
		Basically means the kind of training examples it’s seen. The models have all been fine tuned to refuse to answer certain questions, across many different ways of asking them, including obfuscated and adversarial ones, but poetry is evidently so different from what it’s seen in this type of training that it is not refused.