Remix.run Logo
adamcharnock 3 days ago

I would love to see this come to our various mobile devices in a nicely packaged form. I think part of what is holding back assistants, universal-translators, etc, is poor audio. Both reducing noise and being able to detect direction has a huge potential to help (I want to live-translate a group conversation around a dining table, for example).

Firstly it would be great if my phone + headphones could combine the microphones to this end. But what if all phones in the immediate vicinity could cooperate to provide high quality directional audio? (Assuming privacy issues could be addressed).

abecedarius 3 days ago | parent | next [-]

For the hard of hearing like me the killer application would be live transcription in a noisy setting like a meetup or party, with source separation and grouping of speech from different speakers. Could be life-changing.

(Android's Live Transcribe is very good now but doesn't even try to separate which words are from different speakers.)

adolph 3 days ago | parent [-]

* Automatic speech recognition (ASR) systems have progressed to the point where humans can interact with computing devices using speech. However, the distance between a device and the speaker will cause a loss in speech quality and therefore impact the effectiveness of ASR performance. As such, there is a greater need to have reliable voice capture for far-field speech recognition. The launch of Amazon Echo devices prompted the use of far-field ASR in the consumer electronics space, as it allows its users to interact with the device from several meters away by using microphone array processing techniques.*

https://assets.amazon.science/da/c2/71f5f9fa49f585a4616e49d5...

MVissers 3 days ago | parent | prev | next [-]

I believe modern macbook pro’s already have multiple microphones that probably do some phase-array magic.

refulgentis 3 days ago | parent [-]

Pretty much every device does, the trick always was if it actually worked, which Apple is assuredly great at. (source: worked on Google Assistant)

spaceywilly 3 days ago | parent | prev | next [-]

This is known as the Cocktail Party Problem. It turns out or brains do an incredible amount of processing to allow us to understand a person talking to us in a noisy room.

https://en.wikipedia.org/wiki/Cocktail_party_effect?wprov=sf...

quantadev 3 days ago | parent | prev | next [-]

In general the position of the microphones in space must be known precisely for the phase shifting math to be done well, and also the clocks on the phones would need to be in sync at high precision like 10x the highest frequency sound you're picking up. In other words within 10s of thousands of a second. Also if the array mic locations is not a simple straight line, circle, or other simple geometry the computer code (ie. math) to milk out an improved signal becomes very difficult.

NavinF 3 days ago | parent [-]

> 10s of thousands of a second

10ms? That's a very long time. Phone clocks are much more accurate than that because they're synced to the atomic clocks in cell towers and GPS satellites.

Hell even NTP can do 1ms over the internet. AFAIK the only modern devices with >10ms inaccurate clocks by default are Windows desktops. I complained about that before because it screwed up my one-way latency measurements: https://github.com/microsoft/WSL/issues/6310

I solved that problem by RTFM and toggling some settings until I got the same accuracy as Linux: https://learn.microsoft.com/en-us/windows-server/networking/...

Anyway I dunno why the math would be too complicated, GPUs are great at this kind of signal processing

quantadev 3 days ago | parent [-]

What I meant by that millisecond order of magnitude was that the clocks on the phones would need to be highly synchronized, with each other, to high precision, which would require pre-planning and special efforts.

In 10ms sound can travel about 3 meters, which is on the order of magnitude of a room, and represents the range of time offsets we're talking about. This has nothing to do with the actual frequencies of the sound itself, or the rate of PCM-type sampling you need to record quality sound. That's a different issue, and doesn't have to do with synchronization of different devices.

Regarding the math: A circular array is better than a grid (or random placement) because there's only one single math formula that's used to compare any mic to any other mic. With a grid array the number of unique formulas involved goes up as the square of the size of the array. And the mics at the 'center' of a grid are basically worthless, and offer no added value.

hatsunearu 3 days ago | parent | prev [-]

It's already kind of implemented.