Remix.run Logo
Audio is the one area small labs are winning(amplifypartners.com)
110 points by rocauc 3 days ago | 19 comments
tl2do 4 hours ago | parent | next [-]

This matches my experience. In Kaggle audio competitions, I've seen many competitors struggle with basics like proper PCM filtering - anti-aliasing before downsampling, handling spectral leakage, etc.

Audio really is a blue ocean compared to text/image ML. The barriers aren't primarily compute or data - they're knowledge. You can't scale your way out of bad preprocessing or codec choices.

When 4 researchers can build Moshi from scratch in 6 months while big labs consider voice "solved," it shows we're still in a phase where domain expertise matters more than scale. There's an enormous opportunity here for teams who understand both ML and signal processing fundamentals.

derf_ 2 hours ago | parent | next [-]

Also, while the author complains that there is not a lot of high quality data around [0], you do not need a lot of data to train small models. Depending on the problem you are trying to solve, you can do a lot with single-digit gigabytes of audio data. See, e.g., https://jmvalin.ca/demo/rnnoise/

[0] Which I do agree with, particularly if you need it to be higher quality or labeled in a particular way: the Fisher database mentioned is narrowband and 8-bit mu-law quantized, and while there are timestamps, they are not accurate enough for millisecond-level active speech determination. It is also less than 6000 conversations totaling less than 1000 hours (x2 speakers, but each is silent over half the time, a fact that can also throw a wrench in some standard algorithms, like volume normalization). It is also English-only.

tl2do 2 hours ago | parent [-]

RNNoise is a great example — Jean-Marc Valois proved you can do serious work with kilobits of model weight and modest training data. The "need petabytes or go home" mindset is definitely wrong for audio.

The data bottleneck you mention is real, though, and it's where policy becomes a technical constraint. Japan's copyright law explicitly allows AI training on copyrighted works without permission (Article 30-4). The US is murkier, but case law seems to be trending toward fair use when the model itself and its outputs don't contain reproductions of the original audio.

That distinction matters — training on copyrighted speech is one thing, outputting that same speech is another. If US jurisprudence solidifies around that separation, it opens up a lot more training data without forcing every lab to move to Tokyo.

The Fisher database limitations you noted are exactly why this matters. When you're competing with labs that can legally scrape high-fidelity labeled data from anime/games/audiobooks, legal uncertainty becomes a real competitive disadvantage. Knowledge barriers are tractable. Legal barriers? Those are harder to engineer around.

nubg 36 minutes ago | parent | prev [-]

AI bot comment

nowittyusername 2 hours ago | parent | prev | next [-]

Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.

dkarp 4 hours ago | parent | prev | next [-]

There's too much noise at large organizations

echelon 3 hours ago | parent [-]

They're focused on soaking up big money first.

They'll optimize down the stack once they've sucked all the oxygen out of the room.

Little players won't be able to grow through the ceiling the giants create.

etherus an hour ago | parent [-]

Why would they do that? Once they have their win-condition, there's no reason to innovate. Only to reduce the costs of existing solutions. I expect that unless voice becomes a parameter which drives competition and first-choice for adoption, it will never become a focus of the frontier orgs. Which is curious to me, as almost the opposite of how I'm reading your comment.

giancarlostoro 4 hours ago | parent | prev | next [-]

OpenAI being the death star and audio AI being the rebels is such a weird comparison, like what? Wouldn't the real rebels be the ones running their own models locally?

tl2do 3 hours ago | parent [-]

True, but there's a fun irony: the Rebels' X-Wings are powered by GPUs from a company that's... checks relationships ...also supplying the Empire.

NVIDIA's basically the galaxy's most successful arms dealer, selling to both sides while convincing everyone they're just "enabling innovation." The real rebels would be training audio models on potato-patched RP2040s. Brave souls, if they exist.

garyfirestorm 3 hours ago | parent [-]

not sure about the irony - you can't really expect rebels to start their own weapons manufacturing lab right from converting ore into steel... these things are often supplied by a large manufacturer (which is often a monopoly) why is it any different for a startup to tap into nvidia's proverbial shovel in order to start digging for gold?

tl2do 2 hours ago | parent [-]

Reply to garyfirestorm on HN:

Fair point — the X-Wing analogy breaks down when you look at actual insurgencies. Rebels absolutely use off-the-shelf weapons from whoever will sell to them.

But here's the thing: we're actually entering an era where "homebrew weapons" is becoming possible for inference. Apple's Neural Engine, Google's TPU, Qualcomm's Hexagon — these are NPUs shipping in billions of devices already. You've got startups like Syntiant making ultra-low-power inference chips for always-on voice, and even microcontroller vendors adding ML accelerators.

The "rebel" angle shifts from "manufacturing your own GPU" to "optimizing for the silicon that's already in your pocket." That's where things get interesting — running decent audio models on a $5 Raspberry Pi Zero or an ESP32 with an accelerator add-on.

Granted, training still needs the datacenter. But inference? We're getting to the point where "rebel infrastructure" is just "commodity hardware + smart optimization." I'm betting on that side.

adithyassekhar an hour ago | parent [-]

Your reply has a 0% ai score yet the presence of that "reply to" text is concerning.

RobMurray 25 minutes ago | parent | prev | next [-]

for a laugh enter nonsense at https://gradium.ai/

You get all kinds of weird noises and random words. Jack is often apologetic about the problem you are having with the Hyperion xt5000 smart hub.

bossyTeacher 4 hours ago | parent | prev | next [-]

Surprised ElevenLabs is not mentioned

krackers 3 hours ago | parent [-]

Also 15.ai [1]

[1] https://en.wikipedia.org/wiki/15.ai

SilverElfin an hour ago | parent | prev | next [-]

Does Wisprflow count as an audio “lab”?

amelius 5 hours ago | parent | prev | next [-]

Probably because the big companies have their focus elsewhere.

lysace an hour ago | parent | prev [-]

Also: porn.

Audio is too niche and porn is too ethically messy and legally risky.

There's also music, which the giants also don't touch. Suno is actually really impressive.