AI Evaluation Infrastructure

What you get

Evaluation infrastructure, not another score script.

Version WASM evaluators once, reference them from profiles, and keep scoring logic stable across local runs, CI, and production gates.

Coordinators dispatch durable run chunks while workers call the target agent, execute evaluators, and persist normalized results.

Watch pass/fail outcomes, inspect summaries, and export execution evidence for release decisions and debugging.