Let’s just say there is. Scuttlebutt says there was at least a microphone pick up redesign and a timing redesign because the diarization model loss curve was crap - and given what I hear from the rest of the industry on auto0diarization in conference rooms, I believe that easily. Basically, the AI guys tried to get it working with the standard data they had, and the loss curve was crap no matter how much compute they threw at it. So, they had to go to the HW ppl and say ‘no bueno’ - and someone had to redesign time sync and change a microphone capsule out.

For reference, we are seeing it more and more - sensor design changes to improve loss curve performance - there’s even a term being bandied about : “AI-friendly sensor design”. This does have a nasty side effect of basically breaking abstraction - but that’s the price you pay for using the bitter lesson and letting the model come up with features instead of doing it yourself. (Basically - the sensor->computer abstraction eats details the RL could use to infer stuff)

▲

ImPostingOnHN 4 days ago | parent [-]

I'm not sure who scuttlebutt is, but in the architecture of,

audio goes into mic => STT engine => translation model => TTS engine => audio comes out of speaker

a change in hardware would be a change in the "audio goes into mic" component of the model, which is not the critical part of the model.

All the parts of the above architecture already exist: we already have mics, STT, translation models, TTS, and speakers, and they all worked on other systems before apple even announced this, much less came up with a redesign. Most likely the redesign is aesthetic or just has slightly better sound transmission or reception – none of those were necessary for the functioning of the above architecture in other, non-apple systems.

I am, of course, assuming apple's architecture is a rough approximation of above. An alternative theoretical architecture might resemble the one below, but I have seen no evidence apple is doing this.

audio goes into mic => direct audio-to-audio translation model => audio comes out of speaker

	▲	foobar10000 4 days ago \| parent [-]
		From what I understand, it is the STT engine that is the issue - and is in fact not a solved problem at all. Specifically, in a conversation where the microphones hear 3 people talking, 1 of them talking _at_ us, we need to pick out _that_ person only to translate. If we were using Whisper in that pipeline, we could for example generate speaker embeddings for each segment whisper generates, then group by matching the speaker embeddings - and in reality, this doesn't really work all that well. But we are still left with the question of _WHO_ to feed to the translation model - so, ideally, the person facing us or talking at us - so we'd have to classify the 3 people all talking to each other given their angle in relation to the listener's head, etc.. This is what the diarization model would have to do - and the more sophisticated diarization models certainly could use the precise angle input can only be computed if you have super-close timings.