My first impression coming away from this is skepticism.

Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.

Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".

So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?

Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.

There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.

▲

fny 14 minutes ago | parent | next [-]

It's possible to rely on mouth movements instead of sound. I've been tweaking visual speech recognition models (VSR) for the past few weeks so that I can "talk" to my agents at the office without pissing everyone off. It works okay. Limiting language to "move this" "clear that" along side context cues vastly simplifies the problem and makes it far more possible on device.

I think its brilliant UX.

▲

concinds 3 hours ago | parent | prev | next [-]

The second half of your comment is a go-to-market concern but doesn't feel so relevant for a research prototype. It could be done with a private local model too, maybe not by Google.

But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.

It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.

▲

anon84873628 20 minutes ago | parent | next [-]

It seems that if we ultimately want to "move at the speed of thought," it will require speech.

	▲	Swizec 4 minutes ago \| parent [-]
		> It seems that if we ultimately want to "move at the speed of thought," it will require speech. Except for the large majority of people who read, type, and click way faster than they can talk. Especially for visual things it’s way faster to drag a rectangle than to describe what you want. A lot of us also aren’t linear verbal thinkers. It would take minutes to hours to verbalize concepts we can grasp visually/schematically in seconds. Great book on the topic: https://www.goodreads.com/book/show/60149558-visual-thinking

▲

why_at 2 hours ago | parent | prev | next [-]

Yeah I think there could be something to the integration of AI in an operating system so that it can handle things going on in different applications the same way you can already copy and paste between things.

But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.

▲

aquariusDue 3 hours ago | parent | prev [-]

This will sound like another brick in the paved road to dystopia but I'm kinda bullish on equipment that can recognize subvocalization. Or at least let me have a small drawing tablet with a stylus (think etch-a-sketch or Wacom Intuos) because at this point I'd rather practice writing and do away with typing altogether (even though I enjoy typing for typing's sake via MonkeyType).

▲

nolist_policy 5 hours ago | parent | prev | next [-]

The "Edit an Image" Demo at the bottom is pretty fun. Maybe this is just Google flexing their LLM inference capacity.

	▲	maccard 3 hours ago \| parent [-]
		That demo was an absolute disaster for me on Firefox on mac. It just fundamentally didn't work - the voice wasway behind my pointer, there were multiple agents speaking over each other saying conflicting things, and it couldn't even move the crab to the bottom right of the image. Embarassingly bad I would say!

▲

AirMax98 5 hours ago | parent | prev [-]

Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?

	▲	xg15 4 hours ago \| parent [-]
		I think they answer that question pretty convincingly: Because if what you're looking at is already on the screen, it much more easy to point to it and say "that" than to describe it. (And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)