Remix.run Logo
EMM_386 9 hours ago

> Claude is definitely not taking screenshots of that desktop & organizing, it's using normal file management cli tools

Are you sure about that?

Try "claude --chrome" with the CLI tool and watch what it does in the web browser.

It takes screenshots all the time to feed back into the multimodal vision and help it navigate.

It can look at the HTML or the JavaScript but Claude seems to find it "easier" to take a screenshot to find out what exactly is on the screen. Not parse the DOM.

So I don't know how Cowork does this, but there is no reason it couldn't be doing the same thing.

dalenw 8 hours ago | parent [-]

I wonder if there's something to be said about screenshots preventing context poisoning vs parsing. Or in other words, the "poison" would have to be visible and obvious on the page where as it could be easily hidden in the DOM.

And I do know there are ways to hide data like watermarks in images but I do not know if that would be able to poison an AI.

yencabulator an hour ago | parent [-]

Considering that very subtle not-human-visible tweaks can make vision models misclassify inputs, it seems very plausible that you can include non-human-visible content the model consumes.

https://cacm.acm.org/news/when-images-fool-ai-models/

https://arxiv.org/abs/2306.13213