▲ | jjulius 5 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||
I am ignorant here, this is a genuine question - is there any reason to assume that a paper solely about image mimicry can be blanket-applied, as OP is doing, to audio mimicry? | ||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | mk_stjames 5 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||
To add, all the new audio models (partially) use diffusion methods that are exactly the same methods as used on images - the audio generation can be thought of as an image generation of a spectrogram of an audio file. For early experiments people literally took Stable Diffusion and fine tuned it on labelled spectrograms of music snippets, then used the fine tuned model to generate new images of spectrograms guided by text, and then took those images and turned them back into audio via re-synthesis of that spectral image to a .wav. Riffusion was one of the first to experiment with this, 2 years ago now: https://github.com/riffusion/riffusion-hobby The more advanced music generators out now I believe have more of a 'stems' approach and a larger processing pipeline to increase fidelity and add tracking vocal capability but the underlying idea is the same. Any adversarial attack to hide information in the spectrograph to fool the model into categorizing the track as something it is not isn't different than the image adversarial attacks which have been found to have ways to be mitigated. Various forms of filtering for inaudible spectral information coupled with methods that destroy and re-synthesize/randomize phase information would likely break this poisoning attack. | ||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | Imnimo 5 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||
The short answer is that they are applying the same defense to audio as to images, and so we should expect that the same attacks will work as well. More specifically, there are a few moving parts here - the GenAI model they're trying to defeat, the defense applied to data items, and the data cleaning process that a GenAI company may use to remove the defense. So we can look at each and see if there's any reason to expect things to turn out differently than they did in the image domain. The GenAI models follow the same type of training, and while they of course have slightly different architectures to ingest audio instead of images, they still use the same basic operations. The defenses are exactly the same - find small perturbations that are undetectable to humans but produce a large change in model behavior. The cleaning processes are not particularly image-specific, and translate very naturally to audio. It's stuff like "add some noise and then run denoising". Given all of this, it would be very surprising if the dynamics turned out to be fundamentally different just because we moved from images to audio, and the onus should be on the defense developers to justify why we should expect that to be the case. | ||||||||||||||||||||||||||||||||||||||||||||||||||
|