| ▲ | simianwords 7 hours ago | ||||||||||||||||
I'm not an expert but about false positives: why not make the agent attempt to use the backdoor and verify that it is actually a backdoor? Maybe give it access to tools and so on. | |||||||||||||||||
| ▲ | jakozaur 7 hours ago | parent [-] | ||||||||||||||||
So many models refuse to do that due to alignment and safety concerns. So cross-model comparison doesn't make sense. We do, however, require proof (such as providing a location in binary) that is hard to game. So the model not only has to say there is a backdoor, but also point out the location. Your approach, however, makes a lot of sense if you are ready to have your own custom or fine-tuned model. | |||||||||||||||||
| |||||||||||||||||