Experimental by design

A safer way for LLM agents to touch the outside world

desktop-touch-mcp is an experimental MCP server for giving LLM agents eyes, hands, and a better safety contract with dynamic interfaces.

For now, treat this project as Windows 11 only. It already lets LLMs interact with Windows applications through screenshots, keyboard, mouse, Windows UI Automation, and Chrome DevTools Protocol.

But the deeper goal is not just “look at a screenshot and click some coordinates.” This project is exploring how LLM agents can interact with changing interfaces in a way that is more semantic, more bounded, and less fragile.

Unsafe snapshot-and-act flow versus guarded RPG flow
Current scope

Windows 11 only, not multi-OS yet.

This public site currently documents the Windows 11 path. Multi-OS support is not implemented yet, so the examples, screenshots, and client setup flow should all be read as Windows-first and Windows-only for now.

What this project is

Not a bigger tool catalog. A different contract with the world.

desktop-touch-mcp is a Windows MCP server that exposes screenshots, keyboard and mouse input, Windows UI Automation, Chrome DevTools Protocol, and related desktop-control tools.

How should an LLM agent interact safely with an external world that may already have changed while it was thinking?
Quick start

Want to try it first?

The quickest way to install and launch the runtime is:

npx -y @harusame64/desktop-touch-mcp

On first run, the launcher downloads the matching Windows runtime from GitHub Releases, verifies it, and caches it locally.

Experimental note

This is a public workbench, not a finished product brochure.

Already practical

Some pieces are useful today and already shape real desktop control workflows.

Still under test

Some pieces are active design hypotheses being tested in code and dogfooding.

Evaluation in progress

Benchmarking and systematic evidence are still being built out alongside the implementation.

Why this exists

Most GUI agents trust the world for too long.

Many agents still follow a simple loop:

observe
think
act

On a real desktop, that is often enough to fail:

  • another window comes to the front
  • a modal dialog appears
  • a button moves
  • the target element disappears

So the problem is not only whether the model is intelligent enough. The problem is also whether it is acting on assumptions that are already stale.

Beyond Coordinate Roulette

Meaning-first interaction instead of positional guesswork.

This project publicly describes one of its guiding ideas as Beyond Coordinate Roulette. The phrase points at a familiar failure mode in UI automation: the interface is treated as a flat picture, and action becomes a positional guess.

flowchart TB
    subgraph A["Coordinate roulette"]
        A1["Looks clickable"]
        A2["Guess position"]
        A3["Wrong target"]
        A1 --> A2 --> A3
    end

    subgraph B["Beyond Coordinate Roulette"]
        B1["See entities"]
        B2["Affordances"]
        B3["Lease trust"]
        B4["Guard action"]
        B5["Semantic diff"]
        B1 --> B2 --> B3 --> B4 --> B5
    end

    A3 -. "move beyond this" .- B1

    classDef old fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
    classDef new fill:#e8f1ff,stroke:#3b6db3,color:#183257;
    class A1,A2,A3 old;
    class B1,B2,B3,B4,B5 new;
Core ideas

Four recurring design moves.

Provisional state

Do not keep observed state as timeless truth. Keep it as something that is probably true for now.

Leased trust

Do not trust a target forever. Trust it through a short-lived lease.

Guarded action

Before acting, check whether the assumptions behind the action are still valid.

Demand-driven perception

Do not pay for expensive perception on every step. Escalate only when the situation demands it.

One concrete example

Reactive Perception Graph

One concrete expression of these ideas is the Reactive Perception Graph (RPG). RPG keeps external state provisional, tracks when that state becomes dirty or stale, and evaluates safety checks before an action is allowed to fire.

Screenshots are not truth.
flowchart LR
    subgraph A["Snapshot-and-Act"]
        A1["Observe screen"]
        A2["Think"]
        A3["Act on old assumption"]
        A1 --> A2 --> A3
        AX["World changed"] -.-> A3
    end

    subgraph B["Reactive Perception Graph"]
        B1["Observe target"]
        B2["Store provisional state"]
        B3["Dirty / stale signals"]
        B4["Validate lease"]
        B5["Run guards"]
        B6["Execute or block"]
        B1 --> B2 --> B3 --> B4 --> B5 --> B6
    end

    A3 --> C["Unsafe action"]
    B6 --> D["Safer action contract"]

    classDef bad fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
    classDef good fill:#e5f6ea,stroke:#2e8b57,color:#123d28;
    class C bad;
    class D good;
Explore

Where to go next

Get in touch

Questions, bugs, or ideas? Start with GitHub.

I am not publishing a direct contact email on this site. If you found a bug, hit an integration problem, or want to discuss the direction of the project, the best path is GitHub.