The v1.9–v1.10 Milestone
The founding idea has always been see entities, not coordinates. v1.9 and v1.10 carry it to the two surfaces that resisted it longest: the browser DOM — now targeted by meaning, not by CSS selector — and the apps with no accessible surface at all, where a single act now confirms its own result and points at what to do next.
graph LR
subgraph v140["v1.4: Delivery verified"]
G[Delivery verification]
end
subgraph v1518["v1.5–v1.8: Completion & reach"]
K[terminal exit code]
J[excel VBA bridge]
end
subgraph v19["v1.9: Target by meaning"]
N[browser by text / role / label]
O[modal-aware clicks]
end
subgraph v110["v1.10: Confirm in one call"]
P[roiCapture on visual-only]
Q[changed-region crop + preview]
end
v140 --> v1518 --> v19 --> v110
classDef stable fill:#e8f1ff,stroke:#3b6db3,color:#183257;
classDef fresh fill:#d7f2ed,stroke:#1f8a70,color:#0e3b30;
class G,K,J stable;
class N,O,P,Q fresh;
Two surfaces still addressed by location, not meaning.
By v1.8 the act-and-observe loop was trustworthy: a click reached a live target, a command's completion and exit code were known, and visual verification no longer raced. Two friction points remained, at opposite ends of the spectrum. On the browser, targeting was still positional — browser_click took a CSS selector, a brittle structural path that breaks the moment the markup shifts. And on a visual-only target — an Electron or PWA app, a game, a custom-drawn canvas, a Remote Desktop window — confirming an act meant leaving it: a separate desktop_state and screenshot, then a re-desktop_discover. Three round-trips to confirm one click.
Semantic targeting reaches the browser.
v1.9 lets the browser tools name an element by what it means instead of where it sits in the markup:
browser_click({ by: "text", pattern: "Sign in" })
browser_fill({ by: "ariaLabel", pattern: "Search", value: "mcp" })
by accepts text, role, or ariaLabel; pattern is what to match. The resolver gathers every candidate, and if exactly one matches it acts — if several match, it stops and reports the ambiguity rather than clicking a guess. The same release teaches the browser to tell a real modal dialog from an ordinary navigation drawer: browser_overview reports a machine-readable modal state, and browser_click refuses to click through a blocking dialog onto its backdrop. It is "see entities, not coordinates," finally applied to the surface that always had the entities.
A visual-only act that confirms itself.
v1.10 turns to the other extreme — the targets with no structure to address. When you act on a visual-only target, a successful desktop_act can now bundle a roiCapture straight into its response:
desktop_act({ lease, action: "click", returnCapture: "on-change" })
// → { ok: true, diff: [ … ],
// roiCapture: { somImage, entities, roi, source } }
somImage— a base64 PNG cropped to just the region that changed, not the whole window, so you can see the result of your action.entities— a lightweight, read-only preview of the labels and controls now visible in that region: the next targets, surfaced without a separate discover.roi/source— where the crop came from.
That folds the old three round-trips — act, then desktop_state, then screenshot — into one call. A returnCapture option controls it: "on-change" (the default for visual-only targets) attaches the capture only when the screen actually changed; "always" attaches it on any successful act; "never" suppresses it. The preview entities carry no lease — they are a preview, not a handle — so you re-run desktop_discover to act on one, exactly as before.
Scoped to where it pays, and never a false alarm.
It is scoped to visual-only targets: on a structured target — a browser tab over the DevTools Protocol, an accessibility-rich native window — no roiCapture is attached, because there desktop_state and desktop_discover are cheaper and exact. Spending pixels there would be a regression, not a feature.
And the act's semantic diff stays correct even though the changed region is re-read visually: a stable on-screen label is never reported as having vanished, because the diff is judged against what was already discovered rather than re-read from a tight crop.
Additive, and scoped to where it pays.
The browser by/pattern targeting is new input on existing tools; the selector path is unchanged. roiCapture is additive — it only adds a field, never alters an existing one. On a visual-only target it is on by default for a visible change (returnCapture:"on-change"); pass returnCapture:"never" to suppress it and keep the prior payload, or "always" to force it. On structured targets it never appears, so those responses are unchanged — the heavier capture is computed only in the regime that has no cheaper truth.