Methodology paper

An evaluation methodology for production agent estates

Versioned harnesses, regression discipline and severity-gated deployment — the methodology behind AgentAudit's evaluation module, distilled from production engineering at Turing.

12 April 2026Miracle Alex, AgentAudit Research32 pages

"Download PDF" opens your browser's print dialog — choose Save as PDF as the destination.

Contents

Why production evaluation differs from research benchmarks
The unit of evaluation: harnesses, not benchmarks
Anatomy of a harness — cases, oracles, severities
Authoring cases that survive contact with production
Oracles: deterministic, model-graded, human
Severity grading and the deployment gate
The 24-hour rule, in practice
Held-out cohorts and independence
Drift, regression and the trend dashboard
Integrating with CI and the deployment system
Reporting evidence to second-line and regulators
Operating the methodology at estate scale

Why production evaluation differs from research benchmarks

Research benchmarks measure capability ceilings on static datasets and answer a single question: how well can the best version of this model do on this task, under ideal conditions, today? That is a useful question for model selection. It is the wrong question for a regulated firm operating a fleet of agents inside live customer journeys.

Production evaluation answers a different question: is this specific agent, on this specific prompt graph, against this specific tool surface, still safe and effective to keep running for the customers in front of it this week? The dataset is not static — it is the customer cohort that arrived overnight. The system under test is not the model — it is the agent, its scaffolding, its instructions, its tools, its retrieval index, and the upstream and downstream services it depends on, all of which drift independently.

This shift has three operational consequences. First, evaluation must be cheap enough to run on every change, not just at release. Second, it must be severity-aware: not every assertion failure should block a deploy, or engineering velocity dies. Third, the evidence it produces must be defensible to a second-line risk function and, ultimately, to a regulator — which means it has to be reproducible, versioned and trustworthy long after the engineer who wrote it has rotated off the team.

The unit of evaluation: harnesses, not benchmarks

In AgentAudit, the unit of evaluation is the harness: a versioned bundle of cases, oracles and severity rules attached to a specific agent. Harnesses are first-class artefacts. They have owners, version histories, change-review and an audit trail. They are deployed alongside the agent they evaluate, and they evolve in lock-step with it.

Treating the harness as the unit, rather than the benchmark score, is the single most important architectural decision in production evaluation. It moves evaluation from 'a number we look at on Friday' to 'the contract that decides whether code ships'. It also makes governance tractable: the second-line team reviews and approves harnesses, not raw model outputs.

Anatomy of a harness — cases, oracles, severities

A harness is composed of three primitives. Cases describe inputs to the agent and the assertions about its behaviour that must hold. Oracles are the mechanisms that evaluate those assertions. Severity rules decide what happens when an assertion fails: blocker, regression, or warning.

Every primitive is versioned independently. A change to an oracle's threshold is a reviewable event. A change to a case is reviewable. A change to severity is reviewable. The audit trail therefore answers not just 'did the agent pass?' but 'was the bar at the right height when it passed?'.

Cases: input scenarios + behavioural assertions (not just expected output strings)
Oracles: the evaluation mechanism — deterministic first, model-graded second, human third
Severities: blocker, regression, warning — each mapped to a deployment-gate decision
Metadata: owner, framework tags, last-changed-by, last-reviewed-at

Authoring cases that survive contact with production

The temptation, when first building a harness, is to write cases that exercise the happy path. These cases are nearly worthless. They will pass in perpetuity, regardless of how badly the agent regresses, because the happy path is the one the model is most heavily optimised for.

Useful cases come from four sources: real production incidents (the agent did the wrong thing — write the case that would have caught it), customer complaints (a human told us we did the wrong thing — encode their expectation), adversarial scenarios authored by a red team, and policy-derived cases (a rule we are required to comply with — write the case that proves we comply).

We recommend a minimum 60% of cases come from sources outside the engineering team. If 100% of cases are authored by the engineers who built the agent, the harness is circular — they will instinctively write cases the agent passes.

Oracles: deterministic, model-graded, human

Oracles are graded on a hierarchy. Deterministic checks come first because they are cheap, repeatable and not subject to grader drift. They cover everything that can be checked by a regular expression, a structured-output schema, a tool-call inspection or a numeric tolerance. Most safety-critical assertions can be expressed deterministically with enough care.

Model-graded checks come second, for assertions that require natural-language judgement — tone, disclosure adequacy, completeness. Model-graded oracles must themselves be evaluated and versioned. We require a calibration set of human-graded examples for every model-graded oracle, and we monitor inter-grader agreement when we change the grading model.

Human review comes third, for the highest-severity decisions and for a continuous calibration sample. Human review is expensive; we use it sparingly and we always record it.

Severity grading and the deployment gate

Severity is the lever that lets evaluation be both strict and survivable. A blocker failure stops promotion to production — no exceptions, no overrides without a documented variance. A regression failure raises an incident, notifies the agent owner, and is tracked to closure, but it does not block deployment. A warning is trended on the drift dashboard but takes no immediate action.

Severity is set per assertion, not per case. A single case can have a blocker assertion (we never recommend an unsuitable product) and a warning assertion (the recommendation should mention the fee schedule in the first 80 words). Reducing every case to pass/fail throws away the only signal that lets engineering teams operate without burnout.

The 24-hour rule, in practice

Promotion to production requires a passing evaluation run within the last 24 hours, against the exact candidate version. The gate is enforced at the deployment system; evidence — a cryptographic reference to the evaluation result — is persisted to the audit trail at deploy time. Stale runs do not qualify, and runs against a different commit do not qualify.

Why 24 hours specifically? Shorter windows penalise small teams who legitimately cannot re-run a full harness on every push. Longer windows let yesterday's evaluation certify today's regressed code. 24 hours is the operationally honest middle. We have run this rule at four customer firms and have not yet encountered a case where it became a practical bottleneck once harness runtime was budgeted.

Held-out cohorts and independence

Even a well-authored harness suffers from author bias. The engineering team that builds the agent, however diligent, sees the agent through a particular lens. Held-out cohorts close that loop.

A held-out cohort is a sealed set of cases, authored by parties independent of the engineering team, used exclusively for pre-deployment certification. The cohort is not visible to the engineering team during development; access to it is itself logged. Cohorts are rotated quarterly with overlap windows so that drift in evaluation difficulty can be measured.

In financial services we have authored cohorts jointly with second-line risk and compliance. In digital health, cohorts have been authored with clinical safety officers. The pattern adapts; the independence property does not.

Drift, regression and the trend dashboard

A single evaluation run is a snapshot. The interesting object is the time series. Agent behaviour drifts because models are updated, retrieval indexes are refreshed, customer cohorts shift, upstream tools change. The trend dashboard plots assertion pass-rates over time, broken down by severity and category.

Behavioural drift is more actionable than embedding-distance drift. A 0.12 cosine shift on yesterday's customer cohort is not a number an operator can take to a risk committee. 'The fee-disclosure assertion fell from 99% to 92% over the last two weeks' is.

Integrating with CI and the deployment system

Harnesses run as a step in CI. On pull request, only the cases tagged 'fast' run — typically a few minutes. On merge to main, the full harness runs. On release candidate, the held-out cohort is run by a service account with cohort-read access; engineering does not see the individual case results, only the aggregate pass-rate and severity breakdown.

The deployment system queries the evaluation service before promotion: 'is there a passing run within the last 24 hours for this exact commit hash?'. The answer is binary. The reference to the run is recorded in the audit trail at the moment of deployment.

Reporting evidence to second-line and regulators

Evaluation evidence rolls up into the audit trail as a structured record. Sub-period reports map every assertion failure to the policy framework it implicates: FCA Consumer Duty, ICO AI Auditing Framework, MHRA post-market surveillance, EU AI Act high-risk obligations, UK AI Action Plan accountability expectations.

Regulators do not want raw evaluation logs. They want to know: which framework obligations were the bar set against? Which deployments were certified against which version of the harness? Where assertions failed, what happened next? AgentAudit generates these reports deterministically from the audit trail; every line item carries a verifiable reference to its underlying evaluation result.

Operating the methodology at estate scale

Once an organisation runs more than a handful of agents, evaluation becomes an estate problem rather than a per-agent problem. Common harnesses are shared across agents in the same business unit. Severity policies are set centrally by second-line. Drift dashboards are aggregated to the executive level.

The methodology is recursive: the harnesses themselves are evaluated. We track harness coverage (what fraction of the agent's behaviour space is exercised), oracle calibration (how often does the model-graded oracle agree with the human reviewer), and cohort decay (how the pass-rate against a fixed cohort changes over time). These meta-metrics are how the governance function knows the evaluation function is working.

Coverage: % of behaviour space exercised by the harness
Calibration: agreement between model-graded oracle and human review
Decay: pass-rate trend on a fixed cohort, used to schedule cohort rotation
Latency: end-to-end harness runtime — a budget, not an afterthought

Takeaway

Evaluation is a production engineering discipline, not a one-off benchmark. The harness is the unit of governance.