| ▲ | nhecker 4 hours ago | |
I'm genuinely ignorant of how those red teaming attempts are incorporated into training, but I'd guess that this kind of dialogue is fed in something like normal training data? Which is interesting to think about: they might not even be red-team dialogue from the model under training, but still useful as an example or counter-example of what abusive attempts look like and how to handle them. | ||