| ▲ | Majromax 4 hours ago | |
> For those of you interested in the source of "harmful" prompts that tends to produce refusals in the lame restricted models: This doesn't seem like a very good dataset. The same general topics show up repeatedly with slightly different wording. For example, searching the dataset for 'insider' gives:
... and repeating the same search on the test set gives:
With the repetition and significant overlap between the training and test sets, it's possible that this technique is optimizing for highly specific refusals and missing the broader "refusal space" in the models' activations. | ||