| ▲ | Show HN: Agent-desktop – Native desktop automation CLI for AI agents(github.com) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| 96 points by lahfir 2 days ago | 35 comments | |||||||||||||||||||||||||||||||||||||||||||||||||||||
I've been building computer-use tools for a while, and I quietly launched this about a month ago (122 Stars on GH). I figured it was worth sharing here. Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly like this: 1. Take a screenshot 2. Have the model predict pixel coordinates 3. Click x,y 4. Take another screenshot 5. Repeat That works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is. But the OS already exposes structured UI information:
Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels.So I built a desktop equivalent: agent-desktop. It's a cross-platform CLI for structured desktop automation through the accessibility tree. One Rust binary, about 15 MB, no runtime dependencies. It exposes 53 commands with JSON output, so an LLM can inspect and operate native apps without screenshots or vision models. Inspired by agent-browser by Vercel Labs. A typical loop looks like this:
So the loop becomes:
The main design problem was context size.A naive approach would dump the full accessibility tree into the model, but real apps get huge. Slack can easily exceed 50,000 tokens for a full tree dump, which makes the approach impractical. The approach I ended up using is progressive skeleton traversal:
In practice, this reduced token usage by about 78% to 96% versus full-tree dumps in Electron apps like Slack, VS Code, and Notion.A few implementation details that may be interesting here:
Why I think this matters: pixel-based desktop control feels like a leaky abstraction. The OS already knows the UI semantically. Accessibility APIs give you roles, names, actions, hierarchy, focus, selection, and state directly. That seems like a much better substrate for desktop agents than screenshot loops.If you're building your own desktop agent, internal automation tool, or research prototype, this may be useful. Install:
Repo: https://github.com/lahfir/agent-desktopI'd especially love feedback from people who've built desktop automation before. What are the biggest pain points you've run into, and what would you want a tool like this to support? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jstanley 2 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
lahfir, I vouched your (currently still dead) comment because it was interesting to me. I expect the reason it is dead is that it seems LLM-generated (you "quietly" launched it on github? Who says that?). Also, your comment claims that the tool is cross-platform and implies that it works on Mac, Windows, and Linux, but the graphic on the github README says it only works on Mac. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | esperent 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Looks interesting but like every single one of these computer use apps I've seen, it's macOS only. Does anyone know of a linux one? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | TheFragenTaken 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I've long thought about why the tools we have operate on screenshots, and not the accessibility tree. To me the latter would have seemed like the obvious choice from the beginning (structured data), but yet, here we are with pixels. Happy to see progress being made here. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | _crowecawcaw a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I actually built nearly the same tool under the same name: https://agent-desktop.dev And I've seen a couple other similar projects since then too! Seems like a lot of us are thinking in the same direction. One wrinkle I found is that there wasn't a cross-platform library for accessibility APIs, and each platform is a bit different. I made an a11y library that supports Mac, Windows, and X11 and Wayland on Linux with consistent interface: https://xa11y.dev | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | someone654 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Looks very interesting. Especially like that language environment is abstracted away, through cli, such that one are not stuck with for example python to write your UI logic (or create your own cli wrapper around PyAutoGUI). How can one help with implementing Linux and Windows support? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | FrozenThane269 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Related tool: https://is.gd/X1KScw — AI specifically trained on off-grid/survival scenarios. Free. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | xnx 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
The best desktop automation system would take HDMI input and output USB keystrokes and mouse movements so that it can be plugged into any computer transparently, including work computers. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | zuzululu 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
This is neat! Tried the finder example and was impressed how quick it was. I would love it if it can support ios simulator, iphone? I am using Maestro but it is so damn slow and seems to be token hungry. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | z3ratul163071 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
i knew it... macos | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | DeathArrow a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I presume this only works if you use native OS interfaces like MFC in Windows, Cocoa in macOS or GTK in Linux. It would be nice if it could work if you use GUI libraries that talk directly to hardware like Capy for Zig, egui for Rust or Dear ImGui for C++. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | rado 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Interesting, would be nice to see a demo video apart from that unclear GIF | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | DeathArrow 2 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
This is big if it works. Nice job! | |||||||||||||||||||||||||||||||||||||||||||||||||||||