Why screenshots are not enough

Reactive Perception Graph

By the time an LLM acts, the world may already be different. Reactive Perception Graph is one way to deal with that.

flowchart LR
    subgraph A["Snapshot-and-Act"]
        A1["Observe screen"]
        A2["Think"]
        A3["Act on old assumption"]
        A1 --> A2 --> A3
        AX["World changed"] -.-> A3
    end

    subgraph B["Reactive Perception Graph"]
        B1["Observe target"]
        B2["Store provisional state"]
        B3["Dirty / stale signals"]
        B4["Validate lease"]
        B5["Run guards"]
        B6["Execute or block"]
        B1 --> B2 --> B3 --> B4 --> B5 --> B6
    end

    A3 --> C["Unsafe action"]
    B6 --> D["Safer action contract"]

    classDef bad fill:#fde2e2,stroke:#c0392b,color:#5c1f1f;
    classDef good fill:#e5f6ea,stroke:#2e8b57,color:#123d28;
    class C bad;
    class D good;

The problem

Seeing and touching are separated by time.

Many LLM agents implicitly follow a loop like this: observe the interface, think, then act. That sounds harmless. In a dynamic interface, it is fragile.

What can change?

the user focuses another window
a modal appears
the UI re-renders
the target moves or disappears

What breaks?

The model still acts on the assumptions formed at observation time, even though those assumptions are no longer valid at action time.

Tiny accident story

Correct intent, wrong target.

Suppose an LLM wants to type hello into Notepad. It observes Notepad, decides where to type, another window comes to the front, and it sends hello anyway.

The agent may execute the intended action correctly, but on the wrong target.

That is not mainly an intelligence failure. It is a stale-assumption failure.

Core claim

External state should be treated as provisional.

Reactive Perception Graph is a layer that treats external state as provisional and re-checks the assumptions behind action before the action fires.

Important clarification

RPG is not a screenshot cache. It is a different contract between the agent and the world.

Four ideas

The parts that make the contract work.

flowchart TB
    L["Lens\nWhat am I watching?"]
    P["Provisional state\nWhat do I currently believe?"]
    G["Guard\nIs this action still safe?"]
    T["Lease\nCan I still trust this target?"]
    X["Action"]

    L --> P
    P --> G
    P --> T
    T --> G
    G --> X

    W["World changes"] -. "marks dirty" .-> P
    W -. "can revoke trust" .-> T

    classDef core fill:#eef4ff,stroke:#3b6db3,color:#183257;
    classDef edge fill:#fff6df,stroke:#b8860b,color:#5a4300;
    class L,P,G,T core;
    class X edge;

Provisional state

Keep not only what the agent believes, but how trustworthy that belief still is.

Lens

A watchpoint on something the agent currently cares about.

Guard

A safety check before action when the environment may have drifted.

Lease

A temporary trust contract for an external target.

Before action

Execute should be the final step, not the default step.

flowchart TD
    S["See target"] --> P["Issue lease"]
    P --> Q["Keep state as provisional"]
    Q --> R{"World changed?"}
    R -- "No" --> U["Action proposed"]
    R -- "Yes" --> T["Mark dirty / stale"]
    T --> U
    U --> V{"Lease valid?"}
    V -- "No" --> W["Refresh view"]
    V -- "Yes" --> X{"Guards pass?"}
    X -- "No" --> Y["Block or recover"]
    X -- "Yes" --> Z["Execute action"]

    classDef safe fill:#e7f8ec,stroke:#2e8b57,color:#173d29;
    classDef risk fill:#fff4d6,stroke:#b8860b,color:#5a4300;
    classDef stop fill:#fde8e8,stroke:#c0392b,color:#5c1f1f;
    class Z safe;
    class T,W risk;
    class Y stop;

const lease = issueLease(target);
const state = rememberAsProvisional(target);

if (!validateLease(lease, state)) {
  return refresh();
}

if (!guardsPass(state)) {
  return block();
}

return execute();

Beyond the desktop

This is a broader contract problem.

Browser agents

A DOM observed earlier may no longer match the live page.

Workflow or API agents

A previously fetched resource handle may no longer be valid.

Embodied agents

An object seen a moment ago may no longer be where the agent assumes it is.

Initial MVP

Proving the reflex arc.

Before building the full "nervous system," we spent considerable time in a trial-and-error phase with a Minimum Viable Product (MVP). The goal was to prove the reflex arc—the immediate, low-level loop that protects an action—without the overhead of a complex graph.

The MVP Scope

Cheap Fluents: High-signal, low-cost facts (window presence, foreground status, rect stability).
Basic Guards: "Fail-closed" predicates for identity and coordinate validity.

Intentional Omissions

We consciously excluded "heavy" sensors like full UIA tree traversals or continuous screenshot diffing to find the right balance between latency and safety.

This stage was crucial for finding the right balance between latency and safety. It taught us that most "accidents" could be prevented by just checking a few Win32-level fluents right before the motor command fires.

Concrete failures

Why this exists.

RPG is motivated by concrete failure modes that share the same structure: assumptions valid at observation time are no longer valid at action time.

Common issues

Focus theft: Another window pops up.
Modal insertion: A dialog box blocks the path.
Window drift: The target moved slightly.

Identity risks

Entity replacement: The process restarted.
Delayed action: The agent waited too long to click.

Validation

The next thing that matters is evidence.

To validate this direction, the project still needs to measure:

unsafe action rate
re-observation count
token-heavy observation count
task success rate
recovery steps

See planned evaluation Read Beyond Coordinate Roulette