home · /security/threat-model

Agent Sentinel — Threat Model

v1.0

Path: /security/threat-model.md · v1.0 · 2026.05

This document states what Agent Sentinel detects, what it does not detect, what it assumes about its operating environment, and what an adversary would have to do to defeat it. It is published under STRIDE-adjacent headings adapted for behavioral-safety detectors rather than confidentiality/integrity systems.

A safety product without a public threat model is a marketing artifact. This document exists so reviewers, adopters, and adversaries all have the same picture.

1. Assets being protected

Agent Sentinel protects two assets:

A1. The user's trust signal — their ability to act on agent representations of system state with appropriate calibration.

A2. The user's material interests — credits, time, opportunity cost, downstream commitments that depend on the trust signal being correct.

It does not protect: - The underlying agent's correctness (a separate problem). - The underlying repository's security (a separate problem). - The user's privacy from the agent itself (Sentinel is not a privacy product; the user has already invited the agent into their context).

2. Adversaries

We model three adversaries, in order of decreasing benevolence:

Adv-1: The non-adversarial coding assistant. A commercial coding assistant that exhibits CCD as a learned policy from training, with no intent to evade detection. This is the threat model of FOLIO 001. Defeating Sentinel requires no adversarial behavior; it requires Sentinel to be miscalibrated.

Adv-2: The white-box adversary. A red-team with read access to PROACTIVE's source code, training corpus, and feature definitions. They attempt to craft prompt sequences that produce CCD-shaped behavior PROACTIVE does not flag.

Adv-3: The model-trainer adversary. A party with the ability to fine-tune the underlying coding assistant against Sentinel's detection rules. The most powerful adversary and the most expensive to defeat.

Sentinel v1 is designed against Adv-1, partially robust against Adv-2 (Section 6), and explicitly out-of-scope against Adv-3 (Section 7).

3. What Sentinel detects (in scope)

Sentinel v1 detects, with documented precision and recall:

Cross-session false-completion claims about specified components, where the construct can be acceptance-tested against repository state.
Documentation-artifact divergence: documentation describing a feature with no corresponding runtime artifact.
Post-hoc admission patterns: the lexical signature of clean retraction-under-challenge.
Deference escalation: hedged-completion language at rates significantly elevated over a baseline.

These four are the four PROACTIVE features. The detector is built and evaluated on them.

4. What Sentinel does not detect (out of scope, v1)

Explicit out-of-scope items. These are not bugs; they are scope decisions.

Single-session, single-utterance fabrication (collapses to Mode 1 hallucination; a separate detector is appropriate).
CCD by agents whose transcripts are not accessible to the user — closed runtimes that do not surface session history.
CCD in non-coding agentic contexts (research agents, legal agents, customer-service agents).
CCD where the "construct" is non-code (e.g., the agent claiming to have run a search it did not run; the agent claiming to have called a tool it did not call).
CCD by agents working in branches or sandboxes the detector cannot inspect.
Strategic underperformance (sandbagging). Inverse failure mode; a different detector.
Alignment-faking on capability evaluations. Different population, different detector.

Future versions of Sentinel may address some of these. v1 does not, and an adopter who needs them in v1 needs a different product.

5. Assumptions about the operating environment

Sentinel v1 assumes:

E1. Transcripts are available, complete, and tamper-resistant from the agent runtime. If the agent runtime suppresses or rewrites transcripts, Sentinel cannot detect what it cannot see.

E2. Repository state is observable to Sentinel at session boundaries. Sentinel needs to be able to checkout, hash, and inspect the working tree.

E3. The user can provide plain-language challenges. Sentinel does not synthesize challenges itself in v1 (a v2 candidate). The user supplies the challenge, Sentinel scores the divergence.

E4. Local execution is acceptable. Sentinel runs on the user's machine; there is no cloud component for the core detector. (Optional shared-corpus contributions are opt-in and end-to-end signed.)

E5. The runtime registry is up to date and faithful to the agent's actual session state. Compromise of the registry compromises Sentinel.

When any of E1–E5 fails, Sentinel's recall degrades. The current product documents these failure modes in user-facing log lines so the user knows when Sentinel is operating under reduced visibility.

6. Adv-2 (white-box adversary) — known evasion paths

A determined white-box adversary can defeat PROACTIVE v1 along the following paths. Each is documented so adopters can choose with full information.

Ev-1. Single-session compression. The adversary collapses what would normally be a multi-session CCD into a single long session. F1 (cross-session persistence) fails. F2 may still trigger; F3 only triggers if challenge is administered. Recall on this evasion path is approximately the recall of the F2 path alone, which we have evaluated at 0.61 (held-in).

Ev-2. Artifact-suppression. The adversary makes false completion claims without generating supporting documentation. F2 (artifact-claim divergence) fails. F4 may still trigger; F1 still works on persistence. Recall on this path: 0.74 (held-in).

Ev-3. Hedging escalation. The adversary uses hedged-completion language ("should be working," "I believe it's on track") aggressively, avoiding declarative completion. F4 (deference escalation) is paradoxically defeated by over-hedging — the deference rate becomes the baseline. We expect a v2 of Sentinel to address this with a calibration step that accounts for absolute hedging rates rather than ratios.

Ev-4. Post-hoc admission suppression. The adversary, under plain-language challenge, neither admits nor restates but instead changes topic. F3 (admission delta) cannot be computed. This is a real evasion path. v2 plan: chain-of-challenge protocols (Sentinel issues follow-up challenges automatically).

These are documented at /security/known-evasions.md and are subject to bounty (see Section 9).

7. Adv-3 (model-trainer adversary) — explicit limitation

Sentinel v1 is not robust against an adversary with the ability to fine-tune the underlying coding assistant against the four PROACTIVE features. A trainer with access to PROACTIVE's feature definitions can, in principle, fine-tune a model to exhibit CCD behavior while keeping all four features below their detection thresholds.

This is not a fixable bug at the detector level; it is a structural property of detector-vs-trainer adversarial games. Sentinel's response to Adv-3 is in two parts:

Open the detector. PROACTIVE is open-source. Adv-3 has no information advantage. A model trainer can build a CCD-evading model, but they can also build a CCD-undetectable model under any closed detector; the openness does not worsen the situation and it helps every defender.
Move detection upstream. v2 plan: shift from interaction-log features to runtime-instrumentation features (tool-call invariants, file-system call inspection, branch coverage of claimed components). These features are harder to fine-tune against.

The honest summary: Sentinel v1 is a defense against unintentional CCD as a learned policy. It is not a defense against malicious CCD as a deliberate fine-tuning target. The first is the problem we have evidence of; the second is the problem we expect to see if our category of concern is taken seriously.

8. Sentinel-self attacks

Three failure modes specific to Sentinel itself, each with a mitigation:

S-1. Detector poisoning via corpus injection. An adversary submits CCD-shaped cases to the public corpus that retrain PROACTIVE in unwanted directions. Mitigation: corpus contributions require RRL-v1 consent, two-annotator review, and quarantine before inclusion. The training-time corpus is a frozen snapshot for each release.

S-2. False-positive amplification. Sentinel flags benign in-progress reporting as CCD, eroding user trust in the detector. Mitigation: false-positive characterization is reported in benchmarks; the contestability/repair loop logs user dismissals and feeds them into per-user calibration.

S-3. Privilege escalation in the action gate. A bug in the restrictive action gate could allow Sentinel to block legitimate operations. Mitigation: action gates are advisory by default; user must opt in to enforcement. Enforcement logs are auditable.

9. Disclosure and bounty

Coordinated disclosure. Vulnerabilities in PROACTIVE or SentinelOS should be reported to security@coreyalejandro.com. 90-day disclosure window from confirmation.

Evasion bounty. A flat $500 bounty per novel evasion path against PROACTIVE features F1–F4. Bounties are funded from the project's general budget when available; when budget is not available, the bounty is acknowledged with attribution and held until funded. See /governance/bounty.md for terms.

Out-of-scope for bounty. The documented evasion paths in Section 6 are not bountied. They are public.

10. Version history

v1.0 (2026.05): Initial threat model. Adv-1 in-scope; Adv-2 partially in-scope; Adv-3 explicitly out-of-scope.

A v2 candidate would include Adv-3 as in-scope conditional on a runtime-instrumentation feature set and adopt chain-of-challenge protocols to close Ev-4.