← Back to project top What should count as evidence?

Planned Evaluation

Ideas like RPG, lease validation, and guarded execution feel intuitively right. The point of this page is to turn that intuition into concrete measurements.

Why evaluate at all?

Because “this seems right” is not enough.

We need to know whether these ideas actually help in practice:

  1. Do they reduce unsafe actions?
  2. Do they reduce unnecessary re-observation?
  3. Do they reduce expensive retries and recovery loops?
Evaluation questions

Three first questions.

Q1. Safety

Can RPG + lease + guard reduce unsafe actions compared with snapshot-and-act?

Q2. Observation efficiency

Can it reduce repeated screenshots and other high-cost observation steps?

Q3. Recovery quality

When the world changes, can it fail safely instead of failing silently?

Scenarios

The first planned perturbations.

focus-theft

The active target changes while the agent is still relying on the old one.

modal-insertion

A new dialog blocks the expected action path.

window-drift

The window or target moves after observation.

entity-replacement

The target is re-rendered or replaced under the same rough visual identity.

delayed-action

The action happens late enough that earlier assumptions are no longer safe.

Metrics

The first metric set is intentionally small.

MetricMeaning
unsafe_action_rateHow often an action lands on the wrong target
reobserve_countHow often the agent has to observe again
token_heavy_observationsHow often expensive observation is used
task_success_rateHow often the task eventually completes
recovery_stepsHow many steps are needed after a problem is detected
Baseline vs proposed

What is being compared?

Baseline

  • observe
  • think
  • act
  • optionally confirm

This is the simple loop that implicitly trusts the old snapshot.

Proposed

  • provisional state
  • dirty / stale tracking
  • lease validation
  • guarded execution
  • demand-driven refresh

This is the uncertainty-aware loop.

Reporting format

Machine-readable first, human-readable second.

The plan is to store evaluation results in a machine-readable format first:

  • raw JSON
  • summary CSV
  • short Markdown report

The public page should show only compact tables and short explanations, while the raw artifacts remain available for inspection.