| ▲ | tl2do 5 hours ago | |||||||
This matches my experience. In Kaggle audio competitions, I've seen many competitors struggle with basics like proper PCM filtering - anti-aliasing before downsampling, handling spectral leakage, etc. Audio really is a blue ocean compared to text/image ML. The barriers aren't primarily compute or data - they're knowledge. You can't scale your way out of bad preprocessing or codec choices. When 4 researchers can build Moshi from scratch in 6 months while big labs consider voice "solved," it shows we're still in a phase where domain expertise matters more than scale. There's an enormous opportunity here for teams who understand both ML and signal processing fundamentals. | ||||||||
| ▲ | derf_ 4 hours ago | parent | next [-] | |||||||
Also, while the author complains that there is not a lot of high quality data around [0], you do not need a lot of data to train small models. Depending on the problem you are trying to solve, you can do a lot with single-digit gigabytes of audio data. See, e.g., https://jmvalin.ca/demo/rnnoise/ [0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only. | ||||||||
| ||||||||
| ▲ | duped an hour ago | parent | prev | next [-] | |||||||
imo audio DSP experts are diametrically opposed to AI on moral grounds. Good luck hiring the good ones. It's like paying doctors to design guns. | ||||||||
| ▲ | jlehrer1 22 minutes ago | parent | prev | next [-] | |||||||
[dead] | ||||||||
| ▲ | nubg 2 hours ago | parent | prev [-] | |||||||
AI bot comment | ||||||||