I found it interesting that in the segment where two people were communicating "telepathically", they seem to be producing text, which is then put through text-to-speech (using what appeared to be a voice trained on their own -- nice touch).

I have to wonder, if they have enough signal to produce what essentially looks like speech-to-text (without the speech), wouldn't it be possible to use the exact same signal to directly produce the synthesized speech? It could also lower latency further by not needing extra surrounding context for the text to be pronounced correctly.

▲

com2kid 3 days ago | parent | next [-]

> they seem to be producing text, which is then put through text-to-speech (using what appeared to be a voice trained on their own -- nice touch).

This is an LLM model thing. Plenty of open source (or at least MIT licensed) LLMs and TTS models exist that translate and can be zero shot trained on a user's speech. Direct audio to audio models tend to be less researched and less advanced than the corresponding (but higher latency) audio to text to audio pipelines.

That said you can get audio->text->audio down to 400ms or so latency if you are really damn good at it.

▲

stevage 3 days ago | parent | prev | next [-]

Interesting, I remember reading a sci-fi book a long time ago with almost exactly this same method, which they called "sub-vocalisation".

(I think it was https://en.wikipedia.org/wiki/Oath_of_Fealty_%28novel%29 but can't find enough details to confirm.)

	▲	goopypoop 3 days ago \| parent [-]
		Speaker For The Dead - Orson Scott Card

▲

akdor1154 3 days ago | parent | prev [-]

From memory, i think other recent research is along this approach, but not yet good enough. Cant remember where I read this but was likely HN. I think the posted paper got 95% accuracy when picking from a known set of target sentences/words, but far less (60%?) when used for freeform input.

I'm sure that's not the last word though!