Remix.run Logo
basch 6 hours ago

I’ll disagree with you a little. The reason I don’t use voice is because of context switching.

With a mouse and keyboard I can switch windows.

With my voice, the computer can’t yet automatically determine if I am dictating a transcription or giving editing commands. What I really need is the interpreter listening to me to intuitively to know whether I am in the equivalent of VI command mode or insert mode.

It is the roadblock to not needing a screen at all, right now I want to visualize whether it understood me correctly because if it didn’t switch from insert to command automatically, I now have all my commands written into my paragraph. I also don’t want to listen to the computer talk back to me to confirm it listened. I want to just keep going, to keep narrating my thoughts and trust it’s doing the right things, not having to check. Having it slowly chime in to repeat that it listened derails my flow and train of thought.

TLDR The future of voice is headless vi.

skeledrew 5 hours ago | parent [-]

Problem I see here is you're trying to shoehorn a voice interface onto something that's highly optimized for keyboard input. The apps need to be redesigned to be accommodating of the interface, else it's just never-ending papercuts.

basch 5 hours ago | parent [-]

That’s what I’m saying. Voice as the input requires a completely new ui paradigm, and chat / chatbot isn’t enough.

mrguyorama 22 minutes ago | parent [-]

Voice input will always be inherently worse than mouse and screen plus keyboard, because voice is linear.

It can only ever be a linear sequence of input

The 2 dimensional field of a screen and a mouse and keyboard give you extreme amounts of input and allow you to contextualize that input in arbitrary ways that intuitively make sense to people with minimal training. Most people do not need to be taught that "Paste" goes to the active window.

We barely even touch the surface of what is possible through this set of input devices and output and yet we can't even get that level of fine grained and reliable control into touch screen devices and gamepads, let alone a linear stream of pitch.

Voice cannot be a robust interface. It isn't between humans. There's immense nonverbal communication and human communication also relies very heavily on preshared context to actually get that info across in the first place. Even with all that machinery, human voice is generally considered to only carry, regardless of language, 44ish bits per second of data.