← Back to project top Project Evolution

The v1.9–v1.10 Milestone

The founding idea has always been see entities, not coordinates. v1.9 and v1.10 carry it to the two surfaces that resisted it longest: the browser DOM — now targeted by meaning, not by CSS selector — and the apps with no accessible surface at all, where a single act now confirms its own result and points at what to do next.

graph LR
    subgraph v140["v1.4: Delivery verified"]
        G[Delivery verification]
    end

    subgraph v1518["v1.5–v1.8: Completion & reach"]
        K[terminal exit code]
        J[excel VBA bridge]
    end

    subgraph v19["v1.9: Target by meaning"]
        N[browser by text / role / label]
        O[modal-aware clicks]
    end

    subgraph v110["v1.10: Confirm in one call"]
        P[roiCapture on visual-only]
        Q[changed-region crop + preview]
    end

    v140 --> v1518 --> v19 --> v110

    classDef stable fill:#e8f1ff,stroke:#3b6db3,color:#183257;
    classDef fresh fill:#d7f2ed,stroke:#1f8a70,color:#0e3b30;
    class G,K,J stable;
    class N,O,P,Q fresh;
The shift

Two surfaces still addressed by location, not meaning.

By v1.8 the act-and-observe loop was trustworthy: a click reached a live target, a command's completion and exit code were known, and visual verification no longer raced. Two friction points remained, at opposite ends of the spectrum. On the browser, targeting was still positionalbrowser_click took a CSS selector, a brittle structural path that breaks the moment the markup shifts. And on a visual-only target — an Electron or PWA app, a game, a custom-drawn canvas, a Remote Desktop window — confirming an act meant leaving it: a separate desktop_state and screenshot, then a re-desktop_discover. Three round-trips to confirm one click.

Key Evolution: Target by meaning (v1.9)

Semantic targeting reaches the browser.

v1.9 lets the browser tools name an element by what it means instead of where it sits in the markup:

browser_click({ by: "text", pattern: "Sign in" })
browser_fill({ by: "ariaLabel", pattern: "Search", value: "mcp" })

by accepts text, role, or ariaLabel; pattern is what to match. The resolver gathers every candidate, and if exactly one matches it acts — if several match, it stops and reports the ambiguity rather than clicking a guess. The same release teaches the browser to tell a real modal dialog from an ordinary navigation drawer: browser_overview reports a machine-readable modal state, and browser_click refuses to click through a blocking dialog onto its backdrop. It is "see entities, not coordinates," finally applied to the surface that always had the entities.

Key Evolution: Confirm in one call (v1.10)

A visual-only act that confirms itself.

v1.10 turns to the other extreme — the targets with no structure to address. When you act on a visual-only target, a successful desktop_act can now bundle a roiCapture straight into its response:

desktop_act({ lease, action: "click", returnCapture: "on-change" })
// → { ok: true, diff: [ … ],
//     roiCapture: { somImage, entities, roi, source } }
  • somImage — a base64 PNG cropped to just the region that changed, not the whole window, so you can see the result of your action.
  • entities — a lightweight, read-only preview of the labels and controls now visible in that region: the next targets, surfaced without a separate discover.
  • roi / source — where the crop came from.

That folds the old three round-trips — act, then desktop_state, then screenshot — into one call. A returnCapture option controls it: "on-change" (the default for visual-only targets) attaches the capture only when the screen actually changed; "always" attaches it on any successful act; "never" suppresses it. The preview entities carry no lease — they are a preview, not a handle — so you re-run desktop_discover to act on one, exactly as before.

What keeps it honest

Scoped to where it pays, and never a false alarm.

It is scoped to visual-only targets: on a structured target — a browser tab over the DevTools Protocol, an accessibility-rich native window — no roiCapture is attached, because there desktop_state and desktop_discover are cheaper and exact. Spending pixels there would be a regression, not a feature.

And the act's semantic diff stays correct even though the changed region is re-read visually: a stable on-screen label is never reported as having vanished, because the diff is judged against what was already discovered rather than re-read from a tight crop.

Compatibility

Additive, and scoped to where it pays.

The browser by/pattern targeting is new input on existing tools; the selector path is unchanged. roiCapture is additive — it only adds a field, never alters an existing one. On a visual-only target it is on by default for a visible change (returnCapture:"on-change"); pass returnCapture:"never" to suppress it and keep the prior payload, or "always" to force it. On structured targets it never appears, so those responses are unchanged — the heavier capture is computed only in the regime that has no cheaper truth.