▲ | foobar10000 4 days ago | |||||||
Let’s just say there is. Scuttlebutt says there was at least a microphone pick up redesign and a timing redesign because the diarization model loss curve was crap - and given what I hear from the rest of the industry on auto0diarization in conference rooms, I believe that easily. Basically, the AI guys tried to get it working with the standard data they had, and the loss curve was crap no matter how much compute they threw at it. So, they had to go to the HW ppl and say ‘no bueno’ - and someone had to redesign time sync and change a microphone capsule out. For reference, we are seeing it more and more - sensor design changes to improve loss curve performance - there’s even a term being bandied about : “AI-friendly sensor design”. This does have a nasty side effect of basically breaking abstraction - but that’s the price you pay for using the bitter lesson and letting the model come up with features instead of doing it yourself. (Basically - the sensor->computer abstraction eats details the RL could use to infer stuff) | ||||||||
▲ | ImPostingOnHN 4 days ago | parent [-] | |||||||
I'm not sure who scuttlebutt is, but in the architecture of, audio goes into mic => STT engine => translation model => TTS engine => audio comes out of speaker a change in hardware would be a change in the "audio goes into mic" component of the model, which is not the critical part of the model. All the parts of the above architecture already exist: we already have mics, STT, translation models, TTS, and speakers, and they all worked on other systems before apple even announced this, much less came up with a redesign. Most likely the redesign is aesthetic or just has slightly better sound transmission or reception – none of those were necessary for the functioning of the above architecture in other, non-apple systems. I am, of course, assuming apple's architecture is a rough approximation of above. An alternative theoretical architecture might resemble the one below, but I have seen no evidence apple is doing this. audio goes into mic => direct audio-to-audio translation model => audio comes out of speaker | ||||||||
|