Remix.run Logo
Flux159 9 hours ago

This looks useful for people not using Claude Code, but I do think that the desktop example in the video could be a bit misleading (particularly for non-developers) - Claude is definitely not taking screenshots of that desktop & organizing, it's using normal file management cli tools. The reason seems a bit obvious - it's much easier to read file names, types, etc. via an "ls" than try to infer via an image.

But it also gets to one of Claude's (Opus 4.5) current weaknesses - image understanding. Claude really isn't able to understand details of images in the same way that people currently can - this is also explained well with an analysis of Claude Plays Pokemon https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i.... I think over the next few years we'll probably see all major LLM companies work on resolving these weaknesses & then LLMs using UIs will work significantly better (and eventually get to proper video stream understanding as well - not 'take a screenshot every 500ms' and call that video understanding).

EMM_386 8 hours ago | parent | next [-]

> Claude is definitely not taking screenshots of that desktop & organizing, it's using normal file management cli tools

Are you sure about that?

Try "claude --chrome" with the CLI tool and watch what it does in the web browser.

It takes screenshots all the time to feed back into the multimodal vision and help it navigate.

It can look at the HTML or the JavaScript but Claude seems to find it "easier" to take a screenshot to find out what exactly is on the screen. Not parse the DOM.

So I don't know how Cowork does this, but there is no reason it couldn't be doing the same thing.

dalenw 8 hours ago | parent [-]

I wonder if there's something to be said about screenshots preventing context poisoning vs parsing. Or in other words, the "poison" would have to be visible and obvious on the page where as it could be easily hidden in the DOM.

And I do know there are ways to hide data like watermarks in images but I do not know if that would be able to poison an AI.

yencabulator 44 minutes ago | parent [-]

Considering that very subtle not-human-visible tweaks can make vision models misclassify inputs, it seems very plausible that you can include non-human-visible content the model consumes.

https://cacm.acm.org/news/when-images-fool-ai-models/

https://arxiv.org/abs/2306.13213

oracleclyde 8 hours ago | parent | prev | next [-]

Maybe at one time, but it absolutely understands images now. In VSCode Copilot, I am working on a python app that generates mesh files that are imported in a blender project. I can take a screenshot of what the mesh file looks like and ask Claude code questions about the object, in context of a Blender file. It even built a test script that would generate the mesh and import it into the Blender project, and render a screenshot. It built me a vscode Task to automate the entire workflow and then compare image to a mock image. I found its understanding of the images almost spooky.

re5i5tor 5 hours ago | parent [-]

100% confirm Opus 4.5 is very image smart.

dionian an hour ago | parent [-]

im doing extremely detailed and extremely visual javascript uis with claude code with reactjs and tailwind. driven by lots of screenshots, which often one shot the solution

ElatedOwl 9 hours ago | parent | prev | next [-]

I keep seeing “Claude image understanding is poor” being repeated, but I’ve experienced the opposite.

I was running some sentiment analysis experiments; describe the subject and the subjects emotional state kind of thing. It picked up on a lot of little detail; the brand name of my guitar amplifier in the background, what my t shirt said and that I must enjoy craft beer and or running (it was a craft beer 5k kind of thing), and picked up on my movement through multiple frames. This was a video slicing a frame every 500ms, it noticed me flexing, giving the finger, appearing happy, angry, etc. I was really surprised how much it picked up on, and how well it connected those dots together.

Wowfunhappy 8 hours ago | parent [-]

I regularly show Claude Code a screenshot of a completely broken UI--lots of cut off text, overlapping elements all over the place, the works--and Claude will reply something like "Perfect! The screenshot shows that XYZ is working."

I can describe what is wrong with the screenshot to make Claude fix the problem, but it's not entirely clear to what extent it's using the screenshot versus my description. Any human with two brain cells wouldn't need the problems pointed out.

minimaxir 9 hours ago | parent | prev [-]

Claude Opus 4.5 can understand images: one thing I've done frequently in Claude Code and have had great success is just showing it an image of weird visual behavior (drag and drop into CC) and it finds the bug near-immediately.

The issue is that Claude Code won't automatically Read images by default as a part of its flow: you have to very explicitly prompt it to do so. I suspect a Skill may be more useful here.

spike021 9 hours ago | parent [-]

I've done similar while debugging an iOS app I've been working on this past year.

Occasionally it needs some poking and prodding but not to a substantial degree.

I also was able to use it to generate SVG files based on in-app design using screenshots and code that handles rendering the UI and it was able to do a decent job. Granted not the most complex of SVG but the process worked.