Experimental by design

A safer way for LLM agents to touch the outside world

desktop-touch-mcp is an experimental Windows computer-use MCP server for giving LLM agents eyes, hands, and a better safety contract with dynamic interfaces.

For now, treat this project as Windows 11 only. It already lets LLMs interact with Windows applications through screenshots, keyboard, mouse, Windows UI Automation, and Chrome DevTools Protocol.

But the deeper goal is not just “look at a screenshot and click some coordinates.” This project is exploring how LLM agents can interact with changing interfaces in a way that is more semantic, more bounded, and less fragile.

Client setup Read the RPG explainer Beyond Coordinate Roulette Browse the repository

Unsafe snapshot-and-act flow versus guarded RPG flow

Current scope

Windows 11 only, not multi-OS yet.

This public site currently documents the Windows 11 path. Multi-OS support is not implemented yet, so the examples, screenshots, and client setup flow should all be read as Windows-first and Windows-only for now.

What this project is

Not a bigger tool catalog. A different contract with the world.

desktop-touch-mcp is a Windows MCP server that exposes screenshots, keyboard and mouse input, Windows UI Automation, Chrome DevTools Protocol, and related desktop-control tools.

How should an LLM agent interact safely with an external world that may already have changed while it was thinking?

Quick start

Want to try it first?

The quickest way to install and launch the runtime is:

npx -y @harusame64/desktop-touch-mcp

On first run, the launcher downloads the matching Windows runtime from GitHub Releases, verifies it, and caches it locally.

See client setup examples See installation details Why this project exists

Experimental note

This is a public workbench, not a finished product brochure.

Already practical

Some pieces are useful today and already shape real desktop control workflows.

Still under test

Some pieces are active design hypotheses being tested in code and dogfooding.

Evaluation in progress

Benchmarking and systematic evidence are still being built out alongside the implementation.

Evolution

Project Milestones

As the project evolves, we document major architectural shifts and milestones here.

v1.10: A Visual-Only Act That Confirms Itself

On a target the accessibility tree can't describe — Electron, PWA, game, custom canvas, Remote Desktop — a successful desktop_act can bundle a roiCapture: a PNG crop of just the region that changed plus a lease-less preview of the controls now visible there. It folds "act → desktop_state → screenshot" into a single call. On by default for a visible change; returnCapture:"never" suppresses it. Never attached on structured targets, where desktop_state is cheaper and exact.

v1.9: Semantic Targeting Reaches the Browser

browser_click and browser_fill can target an element by what it means (by:'text'|'role'|'ariaLabel' + a pattern), resolving to a single actionable target and stopping — not guessing — when the match is ambiguous. The browser also learns to tell a real modal dialog from a navigation drawer: browser_overview reports a machine-readable modal state and browser_click refuses to click through a blocking dialog onto its backdrop. "See entities, not coordinates," applied to the web.

v1.8: From Delivery to Completion, and Reaching Deeper Into Apps

Trustworthy delivery extends into trustworthy completion: terminal(action='run') can wait for a command to finish and report its real exit code, the new excel tool runs VBA over COM, and the act-and-observe loop gains race-free visual verification, idle-aware CPU dormancy, a diagnostic log, and a deliberate-dwell emergency stop.

v1.4: From Observation to Memory and Trustworthy Delivery

Four cognitive memory layers the agent can re-query (working, episodic, semantic, procedural) plus delivery verification that closes the silent-failure paths in browser, terminal, and keyboard sends. Plus typed error codes for refusals and a new opt-in foreground_flash channel for Windows Terminal.

v1.2: Putting Meaning Into the Response

The same 28 tools, but every response can now carry its own sense of time (as_of), engine confidence, and causal context — opt-in via the new include argument. Existing callers stay byte-for-byte compatible.

v1.0: Less Surface, More Meaning

A significant consolidation of the tool surface (65 → 28 tools) and the transition to World-Graph and Auto-Perception as the default interaction model.

Why this exists

Most GUI agents trust the world for too long.

Many agents still follow a simple loop:

observe
think
act

On a real desktop, that is often enough to fail:

another window comes to the front
a modal dialog appears
a button moves
the target element disappears

So the problem is not only whether the model is intelligent enough. The problem is also whether it is acting on assumptions that are already stale.

Beyond Coordinate Roulette

Meaning-first interaction instead of positional guesswork.

This project publicly describes one of its guiding ideas as Beyond Coordinate Roulette. The phrase points at a familiar failure mode in UI automation: the interface is treated as a flat picture, and action becomes a positional guess.

flowchart TB
    subgraph A["Coordinate roulette"]
        A1["Looks clickable"]
        A2["Guess position"]
        A3["Wrong target"]
        A1 --> A2 --> A3
    end

    subgraph B["Beyond Coordinate Roulette"]
        B1["See entities"]
        B2["Affordances"]
        B3["Lease trust"]
        B4["Guard action"]
        B5["Semantic diff"]
        B1 --> B2 --> B3 --> B4 --> B5
    end

    A3 -. "move beyond this" .- B1

    classDef old fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
    classDef new fill:#e8f1ff,stroke:#3b6db3,color:#183257;
    class A1,A2,A3 old;
    class B1,B2,B3,B4,B5 new;

Core ideas

Four recurring design moves.

Provisional state

Do not keep observed state as timeless truth. Keep it as something that is probably true for now.

Leased trust

Do not trust a target forever. Trust it through a short-lived lease.

Guarded action

Before acting, check whether the assumptions behind the action are still valid.

Demand-driven perception

Do not pay for expensive perception on every step. Escalate only when the situation demands it.

One concrete example

Reactive Perception Graph

One concrete expression of these ideas is the Reactive Perception Graph (RPG). RPG keeps external state provisional, tracks when that state becomes dirty or stale, and evaluates safety checks before an action is allowed to fire.

Screenshots are not truth.

flowchart LR
    subgraph A["Snapshot-and-Act"]
        A1["Observe screen"]
        A2["Think"]
        A3["Act on old assumption"]
        A1 --> A2 --> A3
        AX["World changed"] -.-> A3
    end

    subgraph B["Reactive Perception Graph"]
        B1["Observe target"]
        B2["Store provisional state"]
        B3["Dirty / stale signals"]
        B4["Validate lease"]
        B5["Run guards"]
        B6["Execute or block"]
        B1 --> B2 --> B3 --> B4 --> B5 --> B6
    end

    A3 --> C["Unsafe action"]
    B6 --> D["Safer action contract"]

    classDef bad fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
    classDef good fill:#e5f6ea,stroke:#2e8b57,color:#123d28;
    class C bad;
    class D good;

Read the full RPG explainer See the evaluation plan

Explore

Questions, bugs, or ideas? Start with GitHub.

I am not publishing a direct contact email on this site. If you found a bug, hit an integration problem, or want to discuss the direction of the project, the best path is GitHub.