Project Evolution

The v1.4 Milestone

v1.0 narrowed the surface. v1.2 deepened each response. The v1.3 and v1.4 series go two more steps: they give the agent a memory of its past calls, and they close the gaps where a call could appear to succeed while doing nothing.

graph LR
    subgraph v100["v1.0: Surface narrowed"]
        A[28 unified tools]
    end

    subgraph v120["v1.2: Response deepened"]
        B[as_of, confidence]
        C[caused_by, based_on]
    end

    subgraph v130["v1.3: Memory added"]
        D[working / episodic]
        E[semantic / procedural]
        F[per-call security tier]
    end

    subgraph v140["v1.4: Delivery verified"]
        G[Typed error codes]
        H[Partial-failure warnings]
        I[Delivery verification]
    end

    v100 --> v120 --> v130 --> v140

    classDef old fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
    classDef stable fill:#e8f1ff,stroke:#3b6db3,color:#183257;
    classDef fresh fill:#d7f2ed,stroke:#1f8a70,color:#0e3b30;
    class A old;
    class B,C stable;
    class D,E,F,G,H,I fresh;

The shift

From a single observed call to a re-queryable history of calls.

v1.2 answered the question "what is true now, in reference to my last action?" in a single round-trip. It still left two questions open: what about the call before that, and did the call I just made actually do anything? The v1.3 and v1.4 series answer those two directly — one by adding a memory the agent can re-query, the other by closing the silent-failure paths in delivery.

Key Evolution: Cognitive memory (v1.3)

Four queryable memory layers, all opt-in.

v1.3 turns the implicit "the agent has just made a tool call" into four explicit layers, each retrievable via the existing include argument:

desktop_state({ include: ["envelope", "causal", "working:5"] })

`working:N`

The most recent N tool calls in compact form (default 10, max 50), including in-flight calls. The "what did I just do?" layer.

`episodic:N`

Completed calls in rich form — lease summary, event ids, elapsed time (default 5, max 100). The "show me how the similar successful call was timed" layer.

`semantic:K`

Successful interaction patterns mined per window title (default 3, max 10). Conservative by design — only fully successful sequences become patterns.

`procedural:K`

Repeated run_macro outcomes that have succeeded ≥3 times, never failed, and contain no destructive tools. Auto-suggesting destructive workflows is intentionally out of scope.

A per-call security tier (memory_strict / memory_balanced / memory_open) adds one more axis. The rule is one-directional: the LLM can ask for more security than the operator's environment ceiling, but never less.

Key Evolution: Delivery verification (v1.4)

"Did my call actually do anything?" — answered in the response.

v1.4 closes the silent-failure gaps where a call could return ok:true without observable effect:

browser_click installs a MutationObserver before the click and inspects DOM mutations, URL changes, and document.activeElement shifts over a 500 ms window. SPA buttons with no listener attached used to silently absorb the click; they now report unverifiable with a typed reason.
browser_fill reads the input's value back after dispatch. Mismatches surface as BrowserFillNotDelivered, with controlled-input transforms distinguished from outright rejection.
terminal({action:'send', method:'background'}) auto-detects hidden-input prompts (such as password:, sudo, Read-Host -AsSecureString) and reports unverifiable instead of a false BackgroundInputNotDelivered.
keyboard({action:'type', method:'background'}) to Windows 11 New Notepad now reports delivered via a ValuePattern fallback when TextPattern is unavailable.

Key Evolution: Typed error codes

Refusals carry their own enum.

Auto-guard refusals (ambiguous target, target not found, blocked by modal, browser not ready, identity changed, …) used to surface as a generic ToolError. Callers had to parse the message string to decide what to do next. v1.4 introduces code: "AutoGuardBlocked" with a status enum and matching recovery hints, so agents can branch on the typed code directly.

Spawn-time refusals from workspace_launch and browser_open get the same treatment: code: "SpawnFailed", with hints covering the common ENOENT / EACCES / EPERM cases.

run_macro with stop_on_error: false now reports nested step failures via a top-level warnings[] array — no more JSON-parsing each step to find them.

Key Evolution: A new channel for Windows Terminal

`method: 'foreground_flash'` — opt-in, additive, explicit.

method: 'background' had been refused on Windows Terminal since v1.3.2 because the WinUI/XAML pipeline silently drops WM_CHAR. v1.4.2 introduces an opt-in alternative: method: 'foreground_flash' briefly steals foreground, pastes the text via the clipboard, and restores the previous foreground in roughly 50–200 ms.

The clipboard is saved and restored across the operation. The response carries explicit hints about the typing-leak risk during the flash window. The default method: 'background' continues to refuse Windows Terminal targets — the silent-fail guarantee from v1.3.2 is unchanged.

Compatibility

Existing callers stay untouched.

If you do not pass any of the new include keywords, nothing changes. Memory persistence and redaction default off. Delivery verification surfaces as opt-in hints.verifyDelivery on existing responses, and the new typed codes preserve the previous error-message prefix so legacy string matching still works. The new capabilities are paid for only by callers who ask for them.

View Full Changelog Read the v1.2 Milestone

The v1.4 Milestone

From a single observed call to a re-queryable history of calls.

Four queryable memory layers, all opt-in.

working:N

episodic:N

semantic:K

procedural:K