The Platform

Four modules. One integrated governance surface.

Evaluation, telemetry, audit trail and governance — designed against UK regulator expectations and the technical reality of production agent estates.

Request a Demo Jump to modules

Agent Behaviour Evaluation Production Telemetry & Tracing Compliance Audit Trail Cross-Team Governance Dashboard

Agent Behaviour Evaluation

Methodology-authored harnesses that test reasoning, tool use, safety and regression — not vanity benchmarks.

Evaluation runs against deterministic harness suites authored by senior practitioners with first-hand experience of agent failure modes in production. Each harness asserts measurable behaviour: tool selection correctness, refusal coverage, hallucination boundaries, structured-output schema conformance and regression against a versioned ground truth set.

Harnesses are versioned alongside agent code, so you can demonstrate to a regulator that the agent shipped to production today is the same agent that passed evaluation in CI. Custom harnesses are first-class: build them against your sector's risk taxonomy without leaving the platform.

Where peer tools treat evaluation as offline batch scoring, AgentAudit treats it as a live regression discipline. Evaluation runs trigger on every agent code change and on every prompt or tool revision — and gate deployment when severity thresholds breach.

Architecture

Evidence flows from agent runtime to regulator-readable export — no manual handoff.

Integrations

GitHub ActionsGitLab CIBitbucket PipelinesOpenAIAnthropicBedrock

How are harnesses authored?

By AgentAudit applied researchers, with a custom-harness builder available to customers.

Do harnesses run on every commit?

Yes — full suite on main, smoke suite on PRs, with severity gating configured per agent.

Can we bring our own evaluation data?

Yes. Customer-owned ground truth and harnesses are first-class and isolated to your tenant.

Production Telemetry & Tracing

Live tracing, behavioural metrics and anomaly detection across every agent call in production.

Every agent invocation is captured as a structured trace: prompt, tool calls, model responses, latency budget, token cost and downstream effect. Traces are queryable across the estate with sub-second filter response times, including across business units and sector overlays.

Behavioural metrics surface drift before users do: refusal-rate excursions, tool-selection distribution shift, latency tail growth and structured-output schema violations all raise alerts against configurable thresholds.

Tracing is opt-in at PII level — customer data classifications govern what is recorded versus what is redacted at capture time, with a documented evidence trail for the ICO.

Architecture

Evidence flows from agent runtime to regulator-readable export — no manual handoff.

Integrations

OpenTelemetryDatadogSplunkGrafanaPagerDutySlack

What is the trace retention default?

12 months hot, 7 years cold for regulated tiers. Configurable per agent.

How is PII handled?

Capture-time redaction governed by your data classification, with audit evidence of every redaction.

Sampling?

Full capture by default for regulated workloads. Sampling configurable for high-volume non-sensitive agents.

Compliance Audit Trail

Regulator-readable decision rationale, immutable logs and on-demand sub-period reporting.

Every consequential decision an agent makes is recorded with its rationale — the prompts, tools, retrieved context, model output and the policy applied — and stored in an immutable append-only log with cryptographic integrity proofs.

Audit reports generate against arbitrary sub-periods: a single section review window, a board reporting cycle, a regulator's audit request scope. Reports are PDF, CSV and JSON, with a regulator-readable formatting profile.

Decision rationale lookup lets a compliance reviewer answer the question 'why did this agent do that?' in seconds — without engineering involvement and without trawling logs.

Architecture

Evidence flows from agent runtime to regulator-readable export — no manual handoff.

Integrations

ConfluenceSharePointBoxDocuSignMicrosoft 365Google Workspace

How is integrity guaranteed?

Append-only with per-batch Merkle proofs; integrity verifiable independently from AgentAudit.

Can we export to our GRC?

Yes — Archer, ServiceNow GRC, OneTrust and custom REST targets are supported.

Sub-period granularity?

Down to the hour. Section reviewers configure their own reporting windows.

Cross-Team Governance Dashboard

Estate-wide risk classification, team and business unit roll-up, board-level reporting.

A single surface for the operator-admin, governance lead, compliance reviewer and board reporter — without context switching. Estate-wide views roll up by team, by business unit, by sector overlay and by risk class.

Each agent carries a risk classification grounded in the UK AI Action Plan taxonomy, with override workflow and named-approver evidence. Risk class drives evaluation cadence, telemetry retention and audit reporting depth automatically.

Board reporting templates produce a quarterly governance pack in one click — built against the disclosure expectations of UK regulated enterprises, not a generic AI dashboard.

Architecture

Evidence flows from agent runtime to regulator-readable export — no manual handoff.

Integrations

OktaAzure ADTableauPower BILooker

Who can override risk class?

Named governance leads with reviewer countersign. Every override is evidenced in the audit trail.

Can the board report be customised?

Yes — the template is editable and customers can author bespoke reporting profiles.

Does the dashboard surface costs?

Yes — token spend, infra cost and outcome metrics roll up at agent, team and BU level.

Ready to govern your agent estate?

Book a 30-minute walkthrough with our team. We'll map your agent estate, regulator surface and rollout plan.

Request a Demo Read the methodology