What should count as evidence?

Planned Evaluation

Ideas like RPG, lease validation, and guarded execution feel intuitively right. The point of this page is to turn that intuition into concrete measurements.

Why evaluate at all?

Because “this seems right” is not enough.

We need to know whether these ideas actually help in practice:

Do they reduce unsafe actions?
Do they reduce unnecessary re-observation?
Do they reduce expensive retries and recovery loops?

Evaluation questions

Three first questions.

Q1. Safety

Can RPG + lease + guard reduce unsafe actions compared with snapshot-and-act?

Q2. Observation efficiency

Can it reduce repeated screenshots and other high-cost observation steps?

Q3. Recovery quality

When the world changes, can it fail safely instead of failing silently?

Scenarios

The first planned perturbations.

focus-theft

The active target changes while the agent is still relying on the old one.

modal-insertion

A new dialog blocks the expected action path.

window-drift

The window or target moves after observation.

entity-replacement

The target is re-rendered or replaced under the same rough visual identity.

delayed-action

The action happens late enough that earlier assumptions are no longer safe.

Metrics

The first metric set is intentionally small.

Metric	Meaning
`unsafe_action_rate`	How often an action lands on the wrong target
`reobserve_count`	How often the agent has to observe again
`token_heavy_observations`	How often expensive observation is used
`task_success_rate`	How often the task eventually completes
`recovery_steps`	How many steps are needed after a problem is detected

Baseline vs proposed

What is being compared?

Baseline

observe
think
act
optionally confirm

This is the simple loop that implicitly trusts the old snapshot.

Proposed

provisional state
dirty / stale tracking
lease validation
guarded execution
demand-driven refresh

This is the uncertainty-aware loop.

Reporting format

Machine-readable first, human-readable second.

The plan is to store evaluation results in a machine-readable format first:

raw JSON
summary CSV
short Markdown report

The public page should show only compact tables and short explanations, while the raw artifacts remain available for inspection.