An evaluation methodology for production agent estates
Versioned harnesses, regression discipline and severity-gated deployment — the methodology behind AgentAudit's evaluation module, distilled from production engineering at Turing.
"Download PDF" opens your browser's print dialog — choose Save as PDF as the destination.
Contents
- Why production evaluation differs from research benchmarks
- The unit of evaluation: harnesses, not benchmarks
- Anatomy of a harness — cases, oracles, severities
- Authoring cases that survive contact with production
- Oracles: deterministic, model-graded, human
- Severity grading and the deployment gate
- The 24-hour rule, in practice
- Held-out cohorts and independence
- Drift, regression and the trend dashboard
- Integrating with CI and the deployment system
- Reporting evidence to second-line and regulators
- Operating the methodology at estate scale
Why production evaluation differs from research benchmarks
Research benchmarks measure capability ceilings on static datasets and answer a single question: how well can the best version of this model do on this task, under ideal conditions, today? That is a useful question for model selection. It is the wrong question for a regulated firm operating a fleet of agents inside live customer journeys.
Production evaluation answers a different question: is this specific agent, on this specific prompt graph, against this specific tool surface, still safe and effective to keep running for the customers in front of it this week? The dataset is not static — it is the customer cohort that arrived overnight. The system under test is not the model — it is the agent, its scaffolding, its instructions, its tools, its retrieval index, and the upstream and downstream services it depends on, all of which drift independently.
This shift has three operational consequences. First, evaluation must be cheap enough to run on every change, not just at release. Second, it must be severity-aware: not every assertion failure should block a deploy, or engineering velocity dies. Third, the evidence it produces must be defensible to a second-line risk function and, ultimately, to a regulator — which means it has to be reproducible, versioned and trustworthy long after the engineer who wrote it has rotated off the team.
The unit of evaluation: harnesses, not benchmarks
In AgentAudit, the unit of evaluation is the harness: a versioned bundle of cases, oracles and severity rules attached to a specific agent. Harnesses are first-class artefacts. They have owners, version histories, change-review and an audit trail. They are deployed alongside the agent they evaluate, and they evolve in lock-step with it.
Treating the harness as the unit, rather than the benchmark score, is the single most important architectural decision in production evaluation. It moves evaluation from 'a number we look at on Friday' to 'the contract that decides whether code ships'. It also makes governance tractable: the second-line team reviews and approves harnesses, not raw model outputs.
Anatomy of a harness — cases, oracles, severities
A harness is composed of three primitives. Cases describe inputs to the agent and the assertions about its behaviour that must hold. Oracles are the mechanisms that evaluate those assertions. Severity rules decide what happens when an assertion fails: blocker, regression, or warning.
Every primitive is versioned independently. A change to an oracle's threshold is a reviewable event. A change to a case is reviewable. A change to severity is reviewable. The audit trail therefore answers not just 'did the agent pass?' but 'was the bar at the right height when it passed?'.
- Cases: input scenarios + behavioural assertions (not just expected output strings)
- Oracles: the evaluation mechanism — deterministic first, model-graded second, human third
- Severities: blocker, regression, warning — each mapped to a deployment-gate decision
- Metadata: owner, framework tags, last-changed-by, last-reviewed-at
Oracles: deterministic, model-graded, human
Oracles are graded on a hierarchy. Deterministic checks come first because they are cheap, repeatable and not subject to grader drift. They cover everything that can be checked by a regular expression, a structured-output schema, a tool-call inspection or a numeric tolerance. Most safety-critical assertions can be expressed deterministically with enough care.
Model-graded checks come second, for assertions that require natural-language judgement — tone, disclosure adequacy, completeness. Model-graded oracles must themselves be evaluated and versioned. We require a calibration set of human-graded examples for every model-graded oracle, and we monitor inter-grader agreement when we change the grading model.
Human review comes third, for the highest-severity decisions and for a continuous calibration sample. Human review is expensive; we use it sparingly and we always record it.
Severity grading and the deployment gate
Severity is the lever that lets evaluation be both strict and survivable. A blocker failure stops promotion to production — no exceptions, no overrides without a documented variance. A regression failure raises an incident, notifies the agent owner, and is tracked to closure, but it does not block deployment. A warning is trended on the drift dashboard but takes no immediate action.
Severity is set per assertion, not per case. A single case can have a blocker assertion (we never recommend an unsuitable product) and a warning assertion (the recommendation should mention the fee schedule in the first 80 words). Reducing every case to pass/fail throws away the only signal that lets engineering teams operate without burnout.
The 24-hour rule, in practice
Promotion to production requires a passing evaluation run within the last 24 hours, against the exact candidate version. The gate is enforced at the deployment system; evidence — a cryptographic reference to the evaluation result — is persisted to the audit trail at deploy time. Stale runs do not qualify, and runs against a different commit do not qualify.
Why 24 hours specifically? Shorter windows penalise small teams who legitimately cannot re-run a full harness on every push. Longer windows let yesterday's evaluation certify today's regressed code. 24 hours is the operationally honest middle. We have run this rule at four customer firms and have not yet encountered a case where it became a practical bottleneck once harness runtime was budgeted.
Held-out cohorts and independence
Even a well-authored harness suffers from author bias. The engineering team that builds the agent, however diligent, sees the agent through a particular lens. Held-out cohorts close that loop.
A held-out cohort is a sealed set of cases, authored by parties independent of the engineering team, used exclusively for pre-deployment certification. The cohort is not visible to the engineering team during development; access to it is itself logged. Cohorts are rotated quarterly with overlap windows so that drift in evaluation difficulty can be measured.
In financial services we have authored cohorts jointly with second-line risk and compliance. In digital health, cohorts have been authored with clinical safety officers. The pattern adapts; the independence property does not.
Drift, regression and the trend dashboard
A single evaluation run is a snapshot. The interesting object is the time series. Agent behaviour drifts because models are updated, retrieval indexes are refreshed, customer cohorts shift, upstream tools change. The trend dashboard plots assertion pass-rates over time, broken down by severity and category.
Behavioural drift is more actionable than embedding-distance drift. A 0.12 cosine shift on yesterday's customer cohort is not a number an operator can take to a risk committee. 'The fee-disclosure assertion fell from 99% to 92% over the last two weeks' is.
Integrating with CI and the deployment system
Harnesses run as a step in CI. On pull request, only the cases tagged 'fast' run — typically a few minutes. On merge to main, the full harness runs. On release candidate, the held-out cohort is run by a service account with cohort-read access; engineering does not see the individual case results, only the aggregate pass-rate and severity breakdown.
The deployment system queries the evaluation service before promotion: 'is there a passing run within the last 24 hours for this exact commit hash?'. The answer is binary. The reference to the run is recorded in the audit trail at the moment of deployment.
Reporting evidence to second-line and regulators
Evaluation evidence rolls up into the audit trail as a structured record. Sub-period reports map every assertion failure to the policy framework it implicates: FCA Consumer Duty, ICO AI Auditing Framework, MHRA post-market surveillance, EU AI Act high-risk obligations, UK AI Action Plan accountability expectations.
Regulators do not want raw evaluation logs. They want to know: which framework obligations were the bar set against? Which deployments were certified against which version of the harness? Where assertions failed, what happened next? AgentAudit generates these reports deterministically from the audit trail; every line item carries a verifiable reference to its underlying evaluation result.
Operating the methodology at estate scale
Once an organisation runs more than a handful of agents, evaluation becomes an estate problem rather than a per-agent problem. Common harnesses are shared across agents in the same business unit. Severity policies are set centrally by second-line. Drift dashboards are aggregated to the executive level.
The methodology is recursive: the harnesses themselves are evaluated. We track harness coverage (what fraction of the agent's behaviour space is exercised), oracle calibration (how often does the model-graded oracle agree with the human reviewer), and cohort decay (how the pass-rate against a fixed cohort changes over time). These meta-metrics are how the governance function knows the evaluation function is working.
- Coverage: % of behaviour space exercised by the harness
- Calibration: agreement between model-graded oracle and human review
- Decay: pass-rate trend on a fixed cohort, used to schedule cohort rotation
- Latency: end-to-end harness runtime — a budget, not an afterthought
Takeaway
Evaluation is a production engineering discipline, not a one-off benchmark. The harness is the unit of governance.