The v1.4 Milestone
v1.0 narrowed the surface. v1.2 deepened each response. The v1.3 and v1.4 series go two more steps: they give the agent a memory of its past calls, and they close the gaps where a call could appear to succeed while doing nothing.
graph LR
subgraph v100["v1.0: Surface narrowed"]
A[28 unified tools]
end
subgraph v120["v1.2: Response deepened"]
B[as_of, confidence]
C[caused_by, based_on]
end
subgraph v130["v1.3: Memory added"]
D[working / episodic]
E[semantic / procedural]
F[per-call security tier]
end
subgraph v140["v1.4: Delivery verified"]
G[Typed error codes]
H[Partial-failure warnings]
I[Delivery verification]
end
v100 --> v120 --> v130 --> v140
classDef old fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
classDef stable fill:#e8f1ff,stroke:#3b6db3,color:#183257;
classDef fresh fill:#d7f2ed,stroke:#1f8a70,color:#0e3b30;
class A old;
class B,C stable;
class D,E,F,G,H,I fresh;
From a single observed call to a re-queryable history of calls.
v1.2 answered the question "what is true now, in reference to my last action?" in a single round-trip. It still left two questions open: what about the call before that, and did the call I just made actually do anything? The v1.3 and v1.4 series answer those two directly — one by adding a memory the agent can re-query, the other by closing the silent-failure paths in delivery.
Four queryable memory layers, all opt-in.
v1.3 turns the implicit "the agent has just made a tool call" into four explicit layers, each retrievable via the existing include argument:
desktop_state({ include: ["envelope", "causal", "working:5"] })
working:N
The most recent N tool calls in compact form (default 10, max 50), including in-flight calls. The "what did I just do?" layer.
episodic:N
Completed calls in rich form — lease summary, event ids, elapsed time (default 5, max 100). The "show me how the similar successful call was timed" layer.
semantic:K
Successful interaction patterns mined per window title (default 3, max 10). Conservative by design — only fully successful sequences become patterns.
procedural:K
Repeated run_macro outcomes that have succeeded ≥3 times, never failed, and contain no destructive tools. Auto-suggesting destructive workflows is intentionally out of scope.
A per-call security tier (memory_strict / memory_balanced / memory_open) adds one more axis. The rule is one-directional: the LLM can ask for more security than the operator's environment ceiling, but never less.
"Did my call actually do anything?" — answered in the response.
v1.4 closes the silent-failure gaps where a call could return ok:true without observable effect:
browser_clickinstalls aMutationObserverbefore the click and inspects DOM mutations, URL changes, anddocument.activeElementshifts over a 500 ms window. SPA buttons with no listener attached used to silently absorb the click; they now reportunverifiablewith a typed reason.browser_fillreads the input'svalueback after dispatch. Mismatches surface asBrowserFillNotDelivered, with controlled-input transforms distinguished from outright rejection.terminal({action:'send', method:'background'})auto-detects hidden-input prompts (such aspassword:,sudo,Read-Host -AsSecureString) and reportsunverifiableinstead of a falseBackgroundInputNotDelivered.keyboard({action:'type', method:'background'})to Windows 11 New Notepad now reportsdeliveredvia aValuePatternfallback whenTextPatternis unavailable.
Refusals carry their own enum.
Auto-guard refusals (ambiguous target, target not found, blocked by modal, browser not ready, identity changed, …) used to surface as a generic ToolError. Callers had to parse the message string to decide what to do next. v1.4 introduces code: "AutoGuardBlocked" with a status enum and matching recovery hints, so agents can branch on the typed code directly.
Spawn-time refusals from workspace_launch and browser_open get the same treatment: code: "SpawnFailed", with hints covering the common ENOENT / EACCES / EPERM cases.
run_macro with stop_on_error: false now reports nested step failures via a top-level warnings[] array — no more JSON-parsing each step to find them.
method: 'foreground_flash' — opt-in, additive, explicit.
method: 'background' had been refused on Windows Terminal since v1.3.2 because the WinUI/XAML pipeline silently drops WM_CHAR. v1.4.2 introduces an opt-in alternative: method: 'foreground_flash' briefly steals foreground, pastes the text via the clipboard, and restores the previous foreground in roughly 50–200 ms.
The clipboard is saved and restored across the operation. The response carries explicit hints about the typing-leak risk during the flash window. The default method: 'background' continues to refuse Windows Terminal targets — the silent-fail guarantee from v1.3.2 is unchanged.
Existing callers stay untouched.
If you do not pass any of the new include keywords, nothing changes. Memory persistence and redaction default off. Delivery verification surfaces as opt-in hints.verifyDelivery on existing responses, and the new typed codes preserve the previous error-message prefix so legacy string matching still works. The new capabilities are paid for only by callers who ask for them.