Remix.run Logo
recursive 6 hours ago

I've noticed this too. I have a few theories about this. Disclosure: I know a little about audio, and very little about audio generative AI.

First, perhaps the models are trained on relatively low-bitrate encodings. Just like image generations sometimes generate JPG artifacts, we could be hearing the known high-frequency loss of low data rate encodings. Another idea is that 'S' and 'T' sounds and similar are relatively broad-spectrum sounds. Not unlike white noise. That kind of sound is known to be difficult to encode for lossy frequency-domain encoding schemes. Perhaps these models work in a similar domain and are subject to similar constraints. Perhaps there's a balance here of low-pass filter vs. "warbly" sounds, and we're hearing a middle ground compromise.

I don't know how it happens, but when I hear the "AI" sound in music, this is usually one of the first tells.