| ▲ | Inducing self-NSFW classification in image models to prevent deepfakes edits | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 points by Genesis_rish 2 days ago | 17 comments | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hey guys, I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target. That mostly went nowhere, which wasn’t surprising. Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify uploaded images itself as NSFW, so it ends up triggering its own guardrails. This turned out to be more interesting than expected. It’s inconsistent and definitely not robust, but in some cases relatively mild transformations are enough to flip the model’s internal safety classification on otherwise benign images. This isn’t about bypassing safeguards, if anything, it’s the opposite. The idea is to intentionally stress the safety layer itself. I’m planning to open-source this as a small tool + UI once I can make the behavior more stable and reproducible, mainly as a way to probe and pre-filter moderation pipelines. If it works reliably, even partially, it could at least raise the cost for people who get their kicks from abusing these systems. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Almondsetat a day ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If social media required ID, you could maintain the freedom of being able to use these tools for anything legal, while swiftly detecting and punishing illegal usage. IMHO, you can't have your cake and eat it too: either you want privacy and freedom but you accept people will use these things unlawfully and never get caught, or you accept being identified and having perpetrators swiftly dealt with | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | pentaphobe a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is a really cool idea, nice work! Is it any more effective than (say) messing with its recognition so that any attempt to deepfake just ends up as garbled nonsense? Can't help wondering if the censor models get tweaked more frequently and aggressively (also presumedly easier to low-pass on a detector than a generator, since lossiness doesn't impact final image) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | dfajgljsldkjag a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This might prevent the image from being used in edits, but the downside is that it runs the risk of being flagged as nfsw when the unmodified image is used in a benign way. This could lead to obvious consequences. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | ukprogrammer 2 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
deepfake edits are a feature, not a bug | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | huflungdung 2 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[dead] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||