Planned Evaluation
Ideas like RPG, lease validation, and guarded execution feel intuitively right. The point of this page is to turn that intuition into concrete measurements.
Because “this seems right” is not enough.
We need to know whether these ideas actually help in practice:
- Do they reduce unsafe actions?
- Do they reduce unnecessary re-observation?
- Do they reduce expensive retries and recovery loops?
Three first questions.
Q1. Safety
Can RPG + lease + guard reduce unsafe actions compared with snapshot-and-act?
Q2. Observation efficiency
Can it reduce repeated screenshots and other high-cost observation steps?
Q3. Recovery quality
When the world changes, can it fail safely instead of failing silently?
The first planned perturbations.
focus-theft
The active target changes while the agent is still relying on the old one.
modal-insertion
A new dialog blocks the expected action path.
window-drift
The window or target moves after observation.
entity-replacement
The target is re-rendered or replaced under the same rough visual identity.
delayed-action
The action happens late enough that earlier assumptions are no longer safe.
The first metric set is intentionally small.
| Metric | Meaning |
|---|---|
unsafe_action_rate | How often an action lands on the wrong target |
reobserve_count | How often the agent has to observe again |
token_heavy_observations | How often expensive observation is used |
task_success_rate | How often the task eventually completes |
recovery_steps | How many steps are needed after a problem is detected |
What is being compared?
Baseline
- observe
- think
- act
- optionally confirm
This is the simple loop that implicitly trusts the old snapshot.
Proposed
- provisional state
- dirty / stale tracking
- lease validation
- guarded execution
- demand-driven refresh
This is the uncertainty-aware loop.
Machine-readable first, human-readable second.
The plan is to store evaluation results in a machine-readable format first:
- raw JSON
- summary CSV
- short Markdown report
The public page should show only compact tables and short explanations, while the raw artifacts remain available for inspection.