Skip to main content

Scaling

Agent Vigilo starts as a single PostgreSQL deployment, but high-volume execution data is keyed by a logical run_shard so large runs can spread across 128 shard partitions and later move to additional database placements.

Current Implementation Status

Implemented today:

  • 128 logical run_shard values.
  • Shard-partitioned high-volume execution tables inside one PostgreSQL database.
  • Chunk-ready messages that carry run_id, run_shard, and chunk_id.
  • Worker hot-path queries that use run_id + run_shard for chunk-local processing.

Not implemented yet:

  • Multiple PostgreSQL placement routing.
  • A runtime db_for_shard(run_id, run_shard) resolver.
  • Coordinator ownership by shard range or database placement.
  • Cross-database run finalization, cancellation, result export, or status fanout.
  • Automated shard migration tooling.

The current system should be read as logical sharding on a single PostgreSQL deployment, with schema and message contracts prepared for future physical shard placement.

Current Shard Model

The system uses 128 logical shards:

run_shard = 0..127

Run chunks receive a shard during run creation. Every execution, attempt, aggregate, and evaluator result produced from that chunk carries the same shard key:

run_chunks(run_id, run_shard, id)
executions(run_id, run_shard, id)
execution_attempts(run_id, run_shard, id)
execution_aggregates(run_id, run_shard, execution_id)
evaluator_results(run_id, run_shard, id)

The high-volume tables are list partitioned by run_shard. A worker that owns one chunk can query by run_id + run_shard + chunk_id, keeping the chunk-local workflow on one partition per table.

Small Deployment

The default deployment can keep every logical shard on one PostgreSQL cluster:

run_shard 0..127 -> primary PostgreSQL

No multi-cluster routing is required to start. The shard key still matters because it prevents one massive run from concentrating all execution writes into one physical partition.

Queue and Worker Routing

Chunk-ready messages include the routing key:

{
"run_id": "00000000-0000-0000-0000-000000000000",
"run_shard": 42,
"chunk_id": "00000000-0000-0000-0000-000000000000"
}

Workers use run_id + run_shard + chunk_id to claim the chunk. In the single-cluster deployment, the database resolver is trivial because every shard points at the same pool. In a future multi-cluster deployment, the resolver should map run_id + run_shard to a database placement before the worker opens a transaction.

Future Placement Layer

When one PostgreSQL cluster is not enough, add placement metadata without changing chunk messages or row identity:

CREATE TABLE shard_placements (
run_id UUID NOT NULL,
run_shard SMALLINT NOT NULL CHECK (run_shard >= 0 AND run_shard < 128),
database_alias TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('active', 'moving', 'draining')),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (run_id, run_shard)
);

Initial placement can map every shard to the primary cluster. Expansion changes placement rows, not the logical shard value stored in execution data.

Migration Runbook

Use this procedure when moving logical shards to a new PostgreSQL placement.

  1. Choose shard scope. Select one or more run_id + run_shard pairs, or a whole run_shard range if the placement layer supports global shard ownership.
  2. Mark placement moving. Set selected placement rows to moving so coordinators stop dispatching new chunks for those shards. Existing workers may finish their current leases.
  3. Drain open work. Wait until selected chunks are no longer leased, or cancel/recover leases according to the chunk recovery policy.
  4. Copy shard-owned rows. Move rows for the selected shard scope from the source placement to the target placement:
run_chunks
executions
execution_attempts
execution_aggregates
evaluator_results
  1. Verify data. Compare row counts and checksums per table for the selected run_id + run_shard scope.
  2. Switch placement. Update placement rows to active on the new database_alias.
  3. Resume dispatch. Coordinators and workers resolve the same run_id + run_shard to the new database placement and continue processing.
  4. Retain then clean up. Keep source rows read-only for a retention window, then delete them after verification and backup policy allow it.

Migration Standards

  • run_shard is stable for the lifetime of a chunk and its child rows.
  • Do not recalculate run_shard when adding database capacity.
  • Queue messages must include run_shard; workers should not rediscover it with an extra lookup on the hot path.
  • Chunk-local worker queries should include run_id and run_shard.
  • Run-level reporting may fan out across all 128 shards for the run.
  • Large payload artifacts should be moved outside PostgreSQL before cross-cluster movement becomes routine.

Recreating a Database

For a fresh deployment, apply the original migration files normally. The 128-shard schema is part of the create-table migrations, so no separate ALTER TABLE migration is required when recreating the database.