Methodology paper

Held-out validation cohorts for pre-deployment certification

Why production deployment certification needs sealed cohorts, how to author them, and how to audit cohort integrity over a deployment lifecycle.

04 June 2026AgentAudit Research16 pages

"Download PDF" opens your browser's print dialog — choose Save as PDF as the destination.

Contents

The certification problem
Why held-out cohorts
Authoring discipline
Sealing and access control
Cohort rotation and overlap
Measuring cohort decay
Patterns from financial services, digital health and public sector

The certification problem

If the engineering team that builds an agent also authors the cases used to certify it for production, the certification is circular. The engineers know — implicitly, and often unconsciously — which scenarios their agent handles well, and they write cases that exercise those scenarios.

The result is a high pass-rate that does not generalise. The agent is certified against the cases the engineers thought to write; the cases the engineers did not think to write are exactly the ones the agent fails on in production. Held-out cohorts break this loop.

Why held-out cohorts

A held-out cohort is a sealed set of cases, authored by parties independent of the engineering team, used exclusively for pre-deployment certification. The engineering team does not see the individual cases. Access to the cohort is itself logged. The cohort's aggregate pass-rate is the certification signal; the per-case detail is reserved for the cohort owner.

The independence property is what makes the certification meaningful. Without it, the cohort is just another harness and inherits the same author-bias.

Authoring discipline

Cohort cases come from four sources. Real production incidents — every time the agent did the wrong thing in production, the cohort gains a case. Customer complaints — every time a human told us the agent did the wrong thing, the cohort gains a case. Adversarial scenarios — authored by a red team explicitly trying to make the agent fail safety assertions. And policy-derived cases — authored by second-line teams to evidence compliance with specific rules.

We recommend a minimum cohort size of 200 cases, distributed roughly 30/30/20/20 across these four sources. Smaller cohorts produce statistically noisy certification signals; larger cohorts are valuable but expensive to maintain.

Sealing and access control

Cohorts are sealed in two senses. First, the cases themselves are not visible to the engineering team — they live in a separate datastore with access controlled by the cohort owner. Second, access to the cohort is logged: any service account that runs the cohort, any human who reviews case-level detail, is recorded.

The log of cohort access is itself an audit artefact. Regulators reviewing an AI agent's certification can ask: 'who has seen these cases?'. The answer is recoverable.

Cohort rotation and overlap

Cohorts decay. The longer a cohort is used, the more the engineering team — through indirect signals like aggregate pass-rates and remediation actions — learns its general shape. We recommend quarterly rotation, with a two-week overlap window during which both the retiring and incoming cohort run side-by-side.

The overlap window does two things. It quantifies cohort difficulty drift (if the new cohort's pass-rate differs sharply from the retiring one's, the difference must be explained). And it provides operational cover for the cohort owner to retire and replace cases without breaking the certification gate.

Measuring cohort decay

We track three decay signals. Pass-rate creep: the trend of aggregate pass-rate against a fixed cohort over time. Sharp upward creep without corresponding agent improvements is a leakage signal. Case-difficulty distribution: the spread of pass-rates across individual cases — a cohort whose cases are all easy or all hard is providing little signal. And remediation correlation: the rate at which cohort failures lead to engineering remediation tickets — a healthy cohort produces actionable failures.

Patterns from financial services, digital health and public sector

In financial services, cohort authors are typically second-line risk and compliance, with cases drawn from suitability-review files and customer complaints. In digital health, cohort authors are clinical safety officers, with cases drawn from incident reports and adverse-event records. In public sector, cohort authors are policy teams, with cases drawn from case-handling appeals and ombudsman referrals.

The roles differ; the independence property is the same. Once a firm has experienced a certification gate that the engineering team cannot trivially overfit to, it is hard to go back.

Takeaway

If the engineering team authors the certification cases, the certification is circular. Independence is the property; rotation and access logging are how you keep it.