Execution Flow & State Model

This document describes how Agent Vigilo processes an evaluation run from start to completion, including execution lifecycle, retry behavior, and event publication.

Overview

At a high level, the system follows a fan-out / fan-in pattern:

Run is created
Executions are generated from a dataset cases
Workers process executions via attempts
Evaluators produce append-only results
Execution aggregates are computed
Run is finalized
Completion event is published

Core Concepts

Run

A run represents a batch evaluation of a versioned agent target against a dataset and evaluation profile.

Execution

An execution represents one dataset case evaluated against the target system.

Attempt

An attempt represents a single worker’s effort to process an execution. Multiple attempts may exist due to retries or failures.

Evaluator Result

An append-only record representing the outcome of a single evaluator.

End-to-End Flow

1. Run Creation

A run is created with:

dataset
evaluation profile
aggregation policy
agent configuration (versioned target)

Reference formats:

run.status = pending
run.gate_status = unknown

Executions are then generated:

execution.status = pending

2. Execution Dispatch

Workers claim executions:

execution.status = running
execution_attempt.status = running

Each attempt is associated with:

a worker
a lease (for failure detection)
an attempt number

3. Agent Invocation

The worker invokes the configured agent target:

may be a single model call
may be a multi-step workflow
is invoked through the run profile agent.http endpoint

The worker sends run/attempt ids, the agent identity, the case input, and non-oracle case metadata. The response is mapped into the evaluator actual envelope before evaluators run.

If the agent call fails:

attempt.status = failed_agent_call
execution.status = retry_scheduled (if retryable)

4. Evaluation Phase

After a successful agent response:

evaluators are executed
results are appended to evaluator_results

Each evaluator produces:

status (passed / failed / error / skipped)
severity
normalized score
evidence

If evaluation fails:

attempt.status = failed_evaluation
execution.status = retry_scheduled (if retryable)

5. Execution Completion

Once all evaluator results are persisted:

execution aggregate is computed
execution is marked terminal

attempt.status = completed
execution.status = completed | failed | timed_out

6. Retry Flow

If an execution fails but is retryable:

execution.status = retry_scheduled

A new attempt is created:

attempt.status = pending → running

Older attempts may become:

attempt.status = stale

This occurs when:

a worker loses its lease
a newer attempt supersedes it

7. Run Finalization

Workers (or a coordinator) check:

Are there any non-terminal executions remaining?

If no:

run.status = finalizing

The system:

aggregates execution results
computes run summary
determines gate_status

run.status = completed
run.gate_status = pass | fail

8. Event Publication (Outbox Pattern)

A RunCompleted event is inserted into the outbox:

outbox.status = pending

A publisher process:

pending → published
        ↘ failed (retry)

This ensures reliable, at-least-once delivery.

State Machines

Run Lifecycle

pending → running → finalizing → completed
                  ↘ failed
                  ↘ cancelled

Execution Lifecycle

pending → running → completed
                  ↘ retry_scheduled → running
                  ↘ failed
                  ↘ timed_out
                  ↘ cancelled

Attempt Lifecycle

pending → running → completed
                  ↘ failed_agent_call
                  ↘ failed_evaluation
                  ↘ timed_out
                  ↘ cancelled
                  ↘ stale

Key Design Properties

1. Append-only evaluator results

Evaluator outputs are never updated, only inserted. This provides:

auditability
reproducibility
traceability

2. Separation of state vs evidence

state tables (runs, executions, attempts) are mutable
evaluator results are immutable facts

3. Idempotent finalization

Multiple workers may attempt to finalize a run.

The system ensures:

only one finalization succeeds
duplicate attempts are safe

4. Retry-safe execution

Executions may have multiple attempts.

Only the most recent non-stale attempt is authoritative.

5. Reliable event delivery

The outbox pattern ensures:

no lost events
retryable publishing
eventual consistency

Design Philosophy

Agent Vigilo evaluates the behavior of a target system, not just a model.

An "agent" may represent:

a single model call
a prompt pipeline
a multi-step workflow
a deployed HTTP service

The evaluation system treats all targets uniformly via a versioned invocation interface.

Summary

The system is designed to:

handle distributed execution safely
tolerate worker failure and retries
preserve evaluation evidence
produce deterministic, policy-driven outcomes
reliably signal completion to downstream systems

Overview​

Core Concepts​

Run​

Execution​

Attempt​

Evaluator Result​

End-to-End Flow​

1. Run Creation​

2. Execution Dispatch​

3. Agent Invocation​

4. Evaluation Phase​

5. Execution Completion​

6. Retry Flow​

7. Run Finalization​

8. Event Publication (Outbox Pattern)​

State Machines​

Run Lifecycle​

Execution Lifecycle​

Attempt Lifecycle​

Key Design Properties​

1. Append-only evaluator results​

2. Separation of state vs evidence​

3. Idempotent finalization​

4. Retry-safe execution​

5. Reliable event delivery​

Design Philosophy​

Summary​