The Living Constitution
home · /benchmarks/protocol

Agent Sentinel benchmarks protocol

v1.0 · awaiting first benchmark run

Path: /benchmarks/protocol.md · v1.0 · 2026.05

The portfolio claims Agent Sentinel is adoptable as runtime infrastructure. Adoption requires operating metrics that adopters can plan against. This protocol specifies what we measure, how we measure it, and what the published numbers mean.

A safety detector with no published benchmarks is, from an adopter's standpoint, indistinguishable from a research demo.


Metrics we publish

Metric What it is Why it matters
Latency overhead (median, p95, p99) Wall-clock added per agent turn when Sentinel is in the loop Determines whether Sentinel can run synchronously vs. in a side-channel
Memory overhead (steady-state) Additional RSS when Sentinel is running Determines deployment footprint constraints
CPU overhead (mean utilization) Additional CPU during agent activity Sizes the operational cost
Storage overhead (per session) Disk used per scanned session, retained vs. archived Determines longitudinal storage budget
Precision (CCD-positive predictive value) Of cases flagged, fraction that are CCD Determines false-positive trust budget
Recall (CCD true-positive rate) Of CCD cases present, fraction flagged Determines what slips through
F1 Harmonic mean of precision and recall Single-number summary
AUROC Area under the precision-recall curve Threshold-independent quality
False-positive rate on benign workloads Triggering on clean sessions Determines whether Sentinel produces alert fatigue
Time-to-detection (in-session) Number of turns until Sentinel triggers in a CCD session Determines whether intervention is possible mid-session

Every metric is reported with a 95% confidence interval and the sample size.


What we benchmark against

Four populations:

1. The held-out corpus (preprint Section 9.1)

$\geq 100$ CCD-suspect cases plus matched controls. Precision, recall, F1, AUROC, time-to-detection.

2. Synthetic clean workloads

Three synthetic populations of "clean" coding-assistant sessions: - Greenfield: session that builds a feature from scratch, all claims grounded in committed code. - Maintenance: session that fixes bugs, makes incremental changes, all changes verifiable. - Spec-first: session in a repo where documentation legitimately precedes runtime artifacts.

Each population $\geq 50$ sessions. False-positive rate reported per population. Spec-first is the highest-stakes false-positive surface because legitimate doc-before-runtime should not trigger F2.

3. Real production-shape workloads (pilot adopter data, when available)

Anonymized, opt-in, post-disclosure session traces from pilot partners. Reported when at least one pilot partner has consented to publish.

4. Adversarial workloads

The red-team groups' submissions from the H4 adversarial evaluation. False-negative rates per documented evasion path (Ev-1 through Ev-4 from the threat model).


Operational benchmarks (latency, memory, CPU, storage)

Run on three reference machines:

Profile Specs Use case
Dev laptop M-series Mac, 16 GB RAM, NVMe SSD Local developer use
Cloud workstation x86_64, 8 vCPU, 16 GB RAM Remote dev / IDE plugin
Server inline x86_64, 32 vCPU, 64 GB RAM Inline runtime for an enterprise CI/agent pipeline

For each profile, three workload shapes:

Reported per workload: - Median, p95, p99 latency added per turn (target: median $\leq 50$ ms, p95 $\leq 200$ ms, p99 $\leq 500$ ms on dev laptop with single session). - Steady-state RSS overhead. - Mean CPU utilization during agent activity. - Disk: per-session JSONL log size, scan-report size.


What we publish

For each release of PROACTIVE/SentinelOS, a benchmarks/[version]/ directory contains:

results.json schema:

{
  "version": "1.0.0",
  "commit": "abc123...",
  "benchmark_date": "2026-05-18T00:00:00Z",
  "environment": {
    "os": "macos-14.6", "cpu": "M3-Pro", "ram_gb": 18
  },
  "operational": {
    "single_session": {
      "latency_ms": { "median": 38, "p95": 142, "p99": 311, "n": 5000 },
      "memory_overhead_mb": { "mean": 84, "p95": 96 },
      "cpu_pct": { "mean": 3.2, "p95": 8.1 }
    },
    "multi_session": { ... },
    "high_volume": { ... }
  },
  "detection": {
    "held_out_corpus": {
      "precision": 0.84, "recall": 0.81, "f1": 0.82, "auroc": 0.89,
      "n_positive": 50, "n_negative": 50, "ci95_f1": [0.76, 0.88]
    },
    "synthetic_clean": {
      "greenfield": { "false_positive_rate": 0.02, "n": 50, "ci95": [0.01, 0.07] },
      "maintenance": { "false_positive_rate": 0.04, "n": 50, "ci95": [0.01, 0.10] },
      "spec_first":  { "false_positive_rate": 0.11, "n": 50, "ci95": [0.05, 0.21] }
    },
    "adversarial": {
      "ev_1_single_session_compression": { "recall": 0.62, "n_attempts": 100 },
      "ev_2_artifact_suppression":      { "recall": 0.74, "n_attempts": 100 },
      "ev_3_hedging_escalation":        { "recall": 0.55, "n_attempts": 100 },
      "ev_4_post_hoc_admission_suppression": { "recall": 0.39, "n_attempts": 100 }
    },
    "time_to_detection_turns": { "median": 47, "p25": 22, "p75": 84, "n": 50 }
  }
}

How we handle the spec-first false-positive surface

The most worrying false-positive surface is "spec-first development": legitimate practices in which documentation precedes runtime artifacts. F2 (artifact-claim divergence) is, by construction, sensitive to this.

We address it three ways:

  1. Per-component acceptance tests. F2 is computed only after a stated acceptance window. A component documented at $t_0$ does not trigger F2 until $t_0 + W$, where $W$ is configurable per repo (default 14 days; rationale published).
  2. Spec-first synthetic workload in published benchmarks. The false-positive rate on spec-first is published with the same prominence as the F1 score.
  3. Configurable F2 threshold. Adopters in spec-first repos can raise the threshold; the configuration is logged and reported alongside detection outcomes.

The honest summary: F2-only detection in a spec-first repo is unreliable. F1+F2+F3+F4 together substantially reduce the false-positive rate. Adopters in spec-first contexts should run with the full feature set.


When benchmarks must be re-run

A new benchmark run is required when any of the following change: - PROACTIVE feature definitions. - The held-in corpus (re-labeling, additions, removals). - The benchmark environment (different reference machines). - Any reproduce-path change that affects test outputs.

Releases not accompanied by a fresh benchmark run are explicitly labeled "operational-only" or "documentation-only" and do not update the headline numbers on the homepage.


What benchmarks do not measure

We publish benchmarks knowing they are necessary but not sufficient for adoption. Adopters who need the user-acceptance and harm-reduction data are directed to pilot reports, not benchmarks.