home · /governance/test-count-claims

Reframing 62/62 / 212/212 tests passing

v1.1 reframing

Path: /governance/test-count-claims.md · v1.0 · 2026.05

The portfolio currently states quantitative test-count claims prominently — "62/62 tests passing," "212/212 tests passing," "1,037 LOC + 994 LOC tests" — as evidence against the "performance art" critique. The numbers are real and reproducible. But the rhetorical move ("passing tests prove the work is real") is a logical step a serious engineering reviewer will flag, because passing tests do not prove the construct is valid. They prove the code runs.

This document records the reframing and the canonical wording used across the site.

The category error to avoid

Passing tests are evidence of three things: 1. The code under test executes. 2. The code's behavior matches the test's expectations. 3. Those expectations were stable enough at the time of testing to capture in assertions.

Passing tests are not evidence that: - The construct the code is named for is the construct it implements. - The detector's features pick up the signal the paper defines. - The corpus annotators correctly applied D1–D5.

The portfolio currently risks reading "passing tests" as evidence of all six, when it is evidence of only three.

The canonical reframing

Before (v1.0 site copy):

The Living Constitution (62/62 tests passing).

After (v1.1 site copy):

The Living Constitution: 62/62 implementation tests passing. Construct-level validation is in the corpus disclosure and the pre-registration.

Before:

PROACTIVE, 100% detection on n=19 cases, 212/212 tests passing.

After:

PROACTIVE: 212/212 implementation tests passing; 100% detection on a held-in corpus of n=19 (Section 5 of the preprint); held-out evaluation pre-registered against falsifier F-1.

Before:

SentinelOS (1,037 LOC + 994 LOC tests).

After:

SentinelOS: 1,037 lines of implementation, 994 lines of tests, 88/88 tests passing. Runtime behavior validated against the consent-aware and local-first invariants in /security/threat-model.md.

The reframing does not change the numbers. It changes the rhetorical claim from "tests pass therefore the work is real" to "implementation tests pass; construct validity is on a separate evidentiary chain."

What each suite tests

For each test suite, the reader should be able to find a one-paragraph plain-language statement of what the suite asserts. These statements are published at /research/test-design.md.

Constitution test suite (62/62)

The Constitution suite tests: - That every named claim in the Constitution has an associated apparatus reference. - That every apparatus reference points to a runnable artifact. - That every "amendment" change in the data model preserves the prior amendment's evidence references (no silent rewrites). - That the manifest's published hashes match the runtime-computed hashes (a content-integrity check). - That the published versions of the preprint, the corpus disclosure, and the threat model are reachable.

The Constitution suite does not test that the construct of CCD is empirically valid. That is a corpus-level question.

PROACTIVE test suite (212/212)

The PROACTIVE suite tests: - That F1–F4 feature extractors behave correctly on edge cases (empty transcripts, single-turn sessions, transcripts with malformed JSONL). - That F1–F4 features compose into a classifier that achieves the documented training-set performance. - That the documented held-in result (100% on n=19) reproduces from the corpus exactly. - That feature serialization round-trips losslessly. - That the classifier's per-feature evidence trail is well-formed.

The PROACTIVE suite does not test that the four features capture the entirety of the CCD behavioral signature. That is the held-out and adversarial evaluation's job.

SentinelOS test suite (88/88)

The SentinelOS suite tests: - That telemetry is local-only (network calls during scan return 0). - That action gates default to advisory; enforcement requires opt-in. - That the contestability/repair loop produces well-formed prompts. - That the audit log is append-only and tamper-detectable. - That the consent state machine transitions correctly.

The SentinelOS suite does not test that the runtime is robust against a determined attacker on the user's machine. That is the threat model's job.

Counter-objection: "Doesn't this reframing weaken the claim?"

A motivated reviewer could read this document as walking back the test-count claims. The honest response: the test-count claims were never the strong claim. They were operational hygiene. The strong claims are the empirical claim (preprint) and the falsifiability (pre-registration). Reframing the test counts as hygiene rather than validation moves them to their correct category and strengthens, rather than weakens, the work's overall evidentiary posture.

A homepage that opens with "62/62 tests passing" alongside "the repos are not typography" invites the reader to conflate "code that runs" with "research that is valid." A homepage that opens with "62/62 implementation tests passing; construct validity is in the corpus and the pre-registration" does not invite that conflation, and it makes the corpus and pre-registration the load-bearing artifacts they should be.

What does not change

The numbers do not change. The tests still pass. The reproduce path still produces 62/62 and 212/212 on a 90-second make verify. The numbers remain on the homepage because they communicate operational quality. They are no longer asked to do work they cannot do.

Audit prompt

Any text that quotes a test-count number in the portfolio should be auditable against this question:

Does the surrounding sentence claim the tests prove implementation correctness (acceptable), or does it claim the tests prove the construct or detection-result (not acceptable)?

If the latter, the wording should be revised to point construct/detection claims to the corpus disclosure or the pre-registration. The phrase "passing tests" is a hygiene signal, not a validation signal.