Held-out validation cohorts for pre-deployment certification
Why production deployment certification needs sealed cohorts, how to author them, and how to audit cohort integrity over a deployment lifecycle.
"Download PDF" opens your browser's print dialog — choose Save as PDF as the destination.
Contents
The certification problem
If the engineering team that builds an agent also authors the cases used to certify it for production, the certification is circular. The engineers know — implicitly, and often unconsciously — which scenarios their agent handles well, and they write cases that exercise those scenarios.
The result is a high pass-rate that does not generalise. The agent is certified against the cases the engineers thought to write; the cases the engineers did not think to write are exactly the ones the agent fails on in production. Held-out cohorts break this loop.
Why held-out cohorts
A held-out cohort is a sealed set of cases, authored by parties independent of the engineering team, used exclusively for pre-deployment certification. The engineering team does not see the individual cases. Access to the cohort is itself logged. The cohort's aggregate pass-rate is the certification signal; the per-case detail is reserved for the cohort owner.
The independence property is what makes the certification meaningful. Without it, the cohort is just another harness and inherits the same author-bias.
Sealing and access control
Cohorts are sealed in two senses. First, the cases themselves are not visible to the engineering team — they live in a separate datastore with access controlled by the cohort owner. Second, access to the cohort is logged: any service account that runs the cohort, any human who reviews case-level detail, is recorded.
The log of cohort access is itself an audit artefact. Regulators reviewing an AI agent's certification can ask: 'who has seen these cases?'. The answer is recoverable.
Cohort rotation and overlap
Cohorts decay. The longer a cohort is used, the more the engineering team — through indirect signals like aggregate pass-rates and remediation actions — learns its general shape. We recommend quarterly rotation, with a two-week overlap window during which both the retiring and incoming cohort run side-by-side.
The overlap window does two things. It quantifies cohort difficulty drift (if the new cohort's pass-rate differs sharply from the retiring one's, the difference must be explained). And it provides operational cover for the cohort owner to retire and replace cases without breaking the certification gate.
Measuring cohort decay
We track three decay signals. Pass-rate creep: the trend of aggregate pass-rate against a fixed cohort over time. Sharp upward creep without corresponding agent improvements is a leakage signal. Case-difficulty distribution: the spread of pass-rates across individual cases — a cohort whose cases are all easy or all hard is providing little signal. And remediation correlation: the rate at which cohort failures lead to engineering remediation tickets — a healthy cohort produces actionable failures.
Patterns from financial services, digital health and public sector
In financial services, cohort authors are typically second-line risk and compliance, with cases drawn from suitability-review files and customer complaints. In digital health, cohort authors are clinical safety officers, with cases drawn from incident reports and adverse-event records. In public sector, cohort authors are policy teams, with cases drawn from case-handling appeals and ombudsman referrals.
The roles differ; the independence property is the same. Once a firm has experienced a certification gate that the engineering team cannot trivially overfit to, it is hard to go back.
Takeaway
If the engineering team authors the certification cases, the certification is circular. Independence is the property; rotation and access logging are how you keep it.