A safer way for LLM agents to touch the outside world
desktop-touch-mcp is an experimental MCP server for giving LLM agents eyes, hands, and a better safety contract with dynamic interfaces.
For now, treat this project as Windows 11 only. It already lets LLMs interact with Windows applications through screenshots, keyboard, mouse, Windows UI Automation, and Chrome DevTools Protocol.
But the deeper goal is not just “look at a screenshot and click some coordinates.” This project is exploring how LLM agents can interact with changing interfaces in a way that is more semantic, more bounded, and less fragile.
Windows 11 only, not multi-OS yet.
This public site currently documents the Windows 11 path. Multi-OS support is not implemented yet, so the examples, screenshots, and client setup flow should all be read as Windows-first and Windows-only for now.
Not a bigger tool catalog. A different contract with the world.
desktop-touch-mcp is a Windows MCP server that exposes screenshots, keyboard and mouse input, Windows UI Automation, Chrome DevTools Protocol, and related desktop-control tools.
How should an LLM agent interact safely with an external world that may already have changed while it was thinking?
Want to try it first?
The quickest way to install and launch the runtime is:
npx -y @harusame64/desktop-touch-mcp
On first run, the launcher downloads the matching Windows runtime from GitHub Releases, verifies it, and caches it locally.
This is a public workbench, not a finished product brochure.
Already practical
Some pieces are useful today and already shape real desktop control workflows.
Still under test
Some pieces are active design hypotheses being tested in code and dogfooding.
Evaluation in progress
Benchmarking and systematic evidence are still being built out alongside the implementation.
Most GUI agents trust the world for too long.
Many agents still follow a simple loop:
observe think act
On a real desktop, that is often enough to fail:
- another window comes to the front
- a modal dialog appears
- a button moves
- the target element disappears
So the problem is not only whether the model is intelligent enough. The problem is also whether it is acting on assumptions that are already stale.
Meaning-first interaction instead of positional guesswork.
This project publicly describes one of its guiding ideas as Beyond Coordinate Roulette. The phrase points at a familiar failure mode in UI automation: the interface is treated as a flat picture, and action becomes a positional guess.
flowchart TB
subgraph A["Coordinate roulette"]
A1["Looks clickable"]
A2["Guess position"]
A3["Wrong target"]
A1 --> A2 --> A3
end
subgraph B["Beyond Coordinate Roulette"]
B1["See entities"]
B2["Affordances"]
B3["Lease trust"]
B4["Guard action"]
B5["Semantic diff"]
B1 --> B2 --> B3 --> B4 --> B5
end
A3 -. "move beyond this" .- B1
classDef old fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
classDef new fill:#e8f1ff,stroke:#3b6db3,color:#183257;
class A1,A2,A3 old;
class B1,B2,B3,B4,B5 new;
Four recurring design moves.
Provisional state
Do not keep observed state as timeless truth. Keep it as something that is probably true for now.
Leased trust
Do not trust a target forever. Trust it through a short-lived lease.
Guarded action
Before acting, check whether the assumptions behind the action are still valid.
Demand-driven perception
Do not pay for expensive perception on every step. Escalate only when the situation demands it.
Reactive Perception Graph
One concrete expression of these ideas is the Reactive Perception Graph (RPG). RPG keeps external state provisional, tracks when that state becomes dirty or stale, and evaluates safety checks before an action is allowed to fire.
Screenshots are not truth.
flowchart LR
subgraph A["Snapshot-and-Act"]
A1["Observe screen"]
A2["Think"]
A3["Act on old assumption"]
A1 --> A2 --> A3
AX["World changed"] -.-> A3
end
subgraph B["Reactive Perception Graph"]
B1["Observe target"]
B2["Store provisional state"]
B3["Dirty / stale signals"]
B4["Validate lease"]
B5["Run guards"]
B6["Execute or block"]
B1 --> B2 --> B3 --> B4 --> B5 --> B6
end
A3 --> C["Unsafe action"]
B6 --> D["Safer action contract"]
classDef bad fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
classDef good fill:#e5f6ea,stroke:#2e8b57,color:#123d28;
class C bad;
class D good;
Where to go next
Questions, bugs, or ideas? Start with GitHub.
I am not publishing a direct contact email on this site. If you found a bug, hit an integration problem, or want to discuss the direction of the project, the best path is GitHub.