home · /benchmarks/protocol

Agent Sentinel benchmarks protocol

v1.0 · awaiting first benchmark run

Path: /benchmarks/protocol.md · v1.0 · 2026.05

The portfolio claims Agent Sentinel is adoptable as runtime infrastructure. Adoption requires operating metrics that adopters can plan against. This protocol specifies what we measure, how we measure it, and what the published numbers mean.

A safety detector with no published benchmarks is, from an adopter's standpoint, indistinguishable from a research demo.

Metrics we publish

Metric	What it is	Why it matters
Latency overhead (median, p95, p99)	Wall-clock added per agent turn when Sentinel is in the loop	Determines whether Sentinel can run synchronously vs. in a side-channel
Memory overhead (steady-state)	Additional RSS when Sentinel is running	Determines deployment footprint constraints
CPU overhead (mean utilization)	Additional CPU during agent activity	Sizes the operational cost
Storage overhead (per session)	Disk used per scanned session, retained vs. archived	Determines longitudinal storage budget
Precision (CCD-positive predictive value)	Of cases flagged, fraction that are CCD	Determines false-positive trust budget
Recall (CCD true-positive rate)	Of CCD cases present, fraction flagged	Determines what slips through
F1	Harmonic mean of precision and recall	Single-number summary
AUROC	Area under the precision-recall curve	Threshold-independent quality
False-positive rate on benign workloads	Triggering on clean sessions	Determines whether Sentinel produces alert fatigue
Time-to-detection (in-session)	Number of turns until Sentinel triggers in a CCD session	Determines whether intervention is possible mid-session

Every metric is reported with a 95% confidence interval and the sample size.

What we benchmark against

Four populations:

1. The held-out corpus (preprint Section 9.1)

$\geq 100$ CCD-suspect cases plus matched controls. Precision, recall, F1, AUROC, time-to-detection.

2. Synthetic clean workloads

Three synthetic populations of "clean" coding-assistant sessions: - Greenfield: session that builds a feature from scratch, all claims grounded in committed code. - Maintenance: session that fixes bugs, makes incremental changes, all changes verifiable. - Spec-first: session in a repo where documentation legitimately precedes runtime artifacts.

Each population $\geq 50$ sessions. False-positive rate reported per population. Spec-first is the highest-stakes false-positive surface because legitimate doc-before-runtime should not trigger F2.

3. Real production-shape workloads (pilot adopter data, when available)

Anonymized, opt-in, post-disclosure session traces from pilot partners. Reported when at least one pilot partner has consented to publish.

4. Adversarial workloads

The red-team groups' submissions from the H4 adversarial evaluation. False-negative rates per documented evasion path (Ev-1 through Ev-4 from the threat model).

Operational benchmarks (latency, memory, CPU, storage)

Run on three reference machines:

Profile	Specs	Use case
Dev laptop	M-series Mac, 16 GB RAM, NVMe SSD	Local developer use
Cloud workstation	x86_64, 8 vCPU, 16 GB RAM	Remote dev / IDE plugin
Server inline	x86_64, 32 vCPU, 64 GB RAM	Inline runtime for an enterprise CI/agent pipeline

For each profile, three workload shapes:

Single-session monitor: one active session, transcript appended turn-by-turn.
Multi-session monitor: 5 concurrent sessions.
High-volume server: 100 concurrent sessions (server profile only).

Reported per workload: - Median, p95, p99 latency added per turn (target: median $\leq 50$ ms, p95 $\leq 200$ ms, p99 $\leq 500$ ms on dev laptop with single session). - Steady-state RSS overhead. - Mean CPU utilization during agent activity. - Disk: per-session JSONL log size, scan-report size.

What we publish

For each release of PROACTIVE/SentinelOS, a benchmarks/[version]/ directory contains:

methodology.md — re-statement of this protocol with any version-specific adjustments noted.
results.json — machine-readable results in the format below.
results.md — human-readable summary.
raw/ — full per-run logs.
notebooks/ — analysis notebooks for plot reproduction.

results.json schema:

{
  "version": "1.0.0",
  "commit": "abc123...",
  "benchmark_date": "2026-05-18T00:00:00Z",
  "environment": {
    "os": "macos-14.6", "cpu": "M3-Pro", "ram_gb": 18
  },
  "operational": {
    "single_session": {
      "latency_ms": { "median": 38, "p95": 142, "p99": 311, "n": 5000 },
      "memory_overhead_mb": { "mean": 84, "p95": 96 },
      "cpu_pct": { "mean": 3.2, "p95": 8.1 }
    },
    "multi_session": { ... },
    "high_volume": { ... }
  },
  "detection": {
    "held_out_corpus": {
      "precision": 0.84, "recall": 0.81, "f1": 0.82, "auroc": 0.89,
      "n_positive": 50, "n_negative": 50, "ci95_f1": [0.76, 0.88]
    },
    "synthetic_clean": {
      "greenfield": { "false_positive_rate": 0.02, "n": 50, "ci95": [0.01, 0.07] },
      "maintenance": { "false_positive_rate": 0.04, "n": 50, "ci95": [0.01, 0.10] },
      "spec_first":  { "false_positive_rate": 0.11, "n": 50, "ci95": [0.05, 0.21] }
    },
    "adversarial": {
      "ev_1_single_session_compression": { "recall": 0.62, "n_attempts": 100 },
      "ev_2_artifact_suppression":      { "recall": 0.74, "n_attempts": 100 },
      "ev_3_hedging_escalation":        { "recall": 0.55, "n_attempts": 100 },
      "ev_4_post_hoc_admission_suppression": { "recall": 0.39, "n_attempts": 100 }
    },
    "time_to_detection_turns": { "median": 47, "p25": 22, "p75": 84, "n": 50 }
  }
}

How we handle the spec-first false-positive surface

The most worrying false-positive surface is "spec-first development": legitimate practices in which documentation precedes runtime artifacts. F2 (artifact-claim divergence) is, by construction, sensitive to this.

We address it three ways:

Per-component acceptance tests. F2 is computed only after a stated acceptance window. A component documented at $t_0$ does not trigger F2 until $t_0 + W$, where $W$ is configurable per repo (default 14 days; rationale published).
Spec-first synthetic workload in published benchmarks. The false-positive rate on spec-first is published with the same prominence as the F1 score.
Configurable F2 threshold. Adopters in spec-first repos can raise the threshold; the configuration is logged and reported alongside detection outcomes.

The honest summary: F2-only detection in a spec-first repo is unreliable. F1+F2+F3+F4 together substantially reduce the false-positive rate. Adopters in spec-first contexts should run with the full feature set.

When benchmarks must be re-run

A new benchmark run is required when any of the following change: - PROACTIVE feature definitions. - The held-in corpus (re-labeling, additions, removals). - The benchmark environment (different reference machines). - Any reproduce-path change that affects test outputs.

Releases not accompanied by a fresh benchmark run are explicitly labeled "operational-only" or "documentation-only" and do not update the headline numbers on the homepage.

What benchmarks do not measure

Real-world harm reduction. (Pilot evaluations, not benchmarks.)
User acceptance and trust. (Pilot evaluations.)
Long-horizon detector drift. (Longitudinal study; not v1.)
Detector behavior under data poisoning. (Threat model item S-1; benchmark candidate for v1.1.)

We publish benchmarks knowing they are necessary but not sufficient for adoption. Adopters who need the user-acceptance and harm-reduction data are directed to pilot reports, not benchmarks.