| ▲ | cubefox 7 hours ago | |||||||
They do use diffusion models, but I don't think they would make a detour via images. They can just generate audio directly with audio diffusion rather than image diffusion. | ||||||||
| ▲ | corysama 6 hours ago | parent [-] | |||||||
There technically was one experiment early on to trick Stable Diffusion into generating spectrograms that could be converted into audio. And, it worked surprisingly well. https://web.archive.org/web/20230314190913/https://www.riffu... https://huggingface.co/riffusion/riffusion-model-v1 But, I'd expect everything in the past 3 years to diffuse the audio waveform directly. | ||||||||
| ||||||||