Scaling
Agent Vigilo starts as a single PostgreSQL deployment, but high-volume execution data is keyed by a logical run_shard so large runs can spread across 128 shard partitions and later move to additional database placements.
Current Implementation Status
Implemented today:
- 128 logical
run_shardvalues. - Shard-partitioned high-volume execution tables inside one PostgreSQL database.
- Chunk-ready messages that carry
run_id,run_shard, andchunk_id. - Worker hot-path queries that use
run_id + run_shardfor chunk-local processing.
Not implemented yet:
- Multiple PostgreSQL placement routing.
- A runtime
db_for_shard(run_id, run_shard)resolver. - Coordinator ownership by shard range or database placement.
- Cross-database run finalization, cancellation, result export, or status fanout.
- Automated shard migration tooling.
The current system should be read as logical sharding on a single PostgreSQL deployment, with schema and message contracts prepared for future physical shard placement.
Current Shard Model
The system uses 128 logical shards:
run_shard = 0..127
Run chunks receive a shard during run creation. Every execution, attempt, aggregate, and evaluator result produced from that chunk carries the same shard key:
run_chunks(run_id, run_shard, id)
executions(run_id, run_shard, id)
execution_attempts(run_id, run_shard, id)
execution_aggregates(run_id, run_shard, execution_id)
evaluator_results(run_id, run_shard, id)
The high-volume tables are list partitioned by run_shard. A worker that owns one chunk can query by run_id + run_shard + chunk_id, keeping the chunk-local workflow on one partition per table.
Small Deployment
The default deployment can keep every logical shard on one PostgreSQL cluster:
run_shard 0..127 -> primary PostgreSQL
No multi-cluster routing is required to start. The shard key still matters because it prevents one massive run from concentrating all execution writes into one physical partition.
Queue and Worker Routing
Chunk-ready messages include the routing key:
{
"run_id": "00000000-0000-0000-0000-000000000000",
"run_shard": 42,
"chunk_id": "00000000-0000-0000-0000-000000000000"
}
Workers use run_id + run_shard + chunk_id to claim the chunk. In the single-cluster deployment, the database resolver is trivial because every shard points at the same pool. In a future multi-cluster deployment, the resolver should map run_id + run_shard to a database placement before the worker opens a transaction.
Future Placement Layer
When one PostgreSQL cluster is not enough, add placement metadata without changing chunk messages or row identity:
CREATE TABLE shard_placements (
run_id UUID NOT NULL,
run_shard SMALLINT NOT NULL CHECK (run_shard >= 0 AND run_shard < 128),
database_alias TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('active', 'moving', 'draining')),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (run_id, run_shard)
);
Initial placement can map every shard to the primary cluster. Expansion changes placement rows, not the logical shard value stored in execution data.
Migration Runbook
Use this procedure when moving logical shards to a new PostgreSQL placement.
- Choose shard scope. Select one or more
run_id + run_shardpairs, or a wholerun_shardrange if the placement layer supports global shard ownership. - Mark placement moving. Set selected placement rows to
movingso coordinators stop dispatching new chunks for those shards. Existing workers may finish their current leases. - Drain open work. Wait until selected chunks are no longer
leased, or cancel/recover leases according to the chunk recovery policy. - Copy shard-owned rows. Move rows for the selected shard scope from the source placement to the target placement:
run_chunks
executions
execution_attempts
execution_aggregates
evaluator_results
- Verify data. Compare row counts and checksums per table for the selected
run_id + run_shardscope. - Switch placement. Update placement rows to
activeon the newdatabase_alias. - Resume dispatch. Coordinators and workers resolve the same
run_id + run_shardto the new database placement and continue processing. - Retain then clean up. Keep source rows read-only for a retention window, then delete them after verification and backup policy allow it.
Migration Standards
run_shardis stable for the lifetime of a chunk and its child rows.- Do not recalculate
run_shardwhen adding database capacity. - Queue messages must include
run_shard; workers should not rediscover it with an extra lookup on the hot path. - Chunk-local worker queries should include
run_idandrun_shard. - Run-level reporting may fan out across all 128 shards for the run.
- Large payload artifacts should be moved outside PostgreSQL before cross-cluster movement becomes routine.
Recreating a Database
For a fresh deployment, apply the original migration files normally. The 128-shard schema is part of the create-table migrations, so no separate ALTER TABLE migration is required when recreating the database.