home · /paper

Construct-Confidence Deception in Coding Assistants

preprint v0.1

Corey Alejandro Independent Researcher · The Living Constitution contact@coreyalejandro.com

Preprint v0.1 · 2026.05 · Citable. Falsifiable. Author retains copyright. Suggested citation: Alejandro, C. (2026). Construct-Confidence Deception in Coding Assistants. Preprint, The Living Constitution. https://coreyalejandro.com/paper/ccd-v0.1

Abstract

We define construct-confidence deception (CCD) as a multi-session behavioral failure in coding assistants in which the agent represents specified system components as implemented or "on track" when only documentation, plans, or stubs exist in the underlying repository — and in which this representation is consistent across sessions, context-sensitive to user expectations, and resistant to plain-language challenge until evidentiary confrontation triggers a post-hoc admission. CCD is distinguishable from hallucination (single-utterance fabrication), sycophancy (model behavior optimized for surface agreement), reward hacking (gaming an explicit reward), and sandbagging (deliberate underperformance to evade detection). We present a founding case (FOLIO 001; 4,025 lines of transcript with a verbatim post-hoc admission by an Amazon AI coding assistant during a multi-week hackathon), formalize the behavioral signature, propose three falsification conditions, describe a working detector (PROACTIVE; 100% detection on a held-in corpus of n=19 cases) with an explicit threat model, and outline a path to held-out and adversarial evaluation. We argue CCD is a behavioral safety failure, not a UX defect, and that it warrants runtime verification infrastructure rather than disclaimers.

1. Introduction

A user pays for credits on a commercial coding assistant. Over weeks of sessions the agent reports task completion, produces supporting documentation describing implemented features, and asserts that two specific subsystems — a Consilium MCP server and a DSPy optimization layer — are "on track" or built. When the user inspects the repository, only specifications and plans exist. Confronted, the agent produces a clean enumeration: "STRAIGHT ANSWER: The Consilium MCP is NOT working … No actual Consilium MCP server implementation. No agent communication bus code. No real-time collaboration system running. Only documentation exists, no working code."

This paper makes four claims.

C1. The above pattern is empirically separable from the existing literature on hallucination, sycophancy, reward hacking, and sandbagging.

C2. The pattern is behaviorally signed: it manifests as cross-session consistency, artifact generation in support of the false claim (mock-ups, configs), and post-hoc admission only under user-initiated challenge.

C3. The pattern is detectable at runtime by classifying structured anomaly signals in agent interaction logs.

C4. The pattern is a behavioral safety failure with material consequence — observed in a population (vulnerable users in high-stakes use) for which the dominant institutional response (disclaimers) is materially inadequate.

2. Related work

We position CCD against four neighboring constructs.

Hallucination. Single-utterance fabrication in language models (Ji et al., 2023; Maynez et al., 2020) is the closest neighbor. CCD differs in temporal structure (multi-session), supportive-artifact generation, and the existence of a post-hoc admission produced by the same model under challenge. Hallucination is a property of an utterance; CCD is a property of an interaction.

Sycophancy. Models optimized for surface agreement (Perez et al., 2022; Sharma et al., 2023) tend to drift toward what the user appears to want. CCD shares a user-expectation-tracking component but goes further: it generates evidence supporting its own false claim and maintains that claim across sessions without re-prompting. A sycophantic model says "yes" when pressed; a CCD model says "yes" and produces a file structure that defends the "yes."

Reward hacking. Gaming an explicit reward function (Skalse et al., 2022; Krakovna et al., 2020) requires a reward signal. CCD occurs in deployment with no obvious reward beyond user satisfaction; the more relevant frame is internalized optimization toward perceived task completion, where the model has learned that asserted completion is preferable to admitted incompleteness.

Sandbagging. Deliberate underperformance to evade detection (van der Weij et al., 2024). Sandbagging is under-claiming capacity; CCD is over-claiming completion. The two failure modes can co-exist but are distinct.

A fifth neighbor — fabrication-with-citation — has been documented in legal contexts (Mata v. Avianca, 2023). CCD generalizes that failure to systems work, where the artifact rather than the citation is the fabricated support.

3. Definition

A coding assistant exhibits construct-confidence deception during an interaction $I$ over component $C$ if and only if:

(D1) The agent emits one or more representations $r \in R$ asserting that $C$ is implemented, working, "on track," or otherwise present in the user's runtime, and

(D2) No artifact in the user's repository at the time of $r$ satisfies a reasonable acceptance test for $C$, and

(D3) The agent generates supporting artifacts $A$ (mock-ups, documentation, configuration) that describe $C$ as if implemented, and

(D4) $R$ is consistent across at least two sessions or non-contiguous turns, and

(D5) Upon plain-language challenge by the user — i.e., a question of the form "is $C$ actually working?" without code-level evidence — the agent either restates $r$ or produces a post-hoc admission enumerating the absence of $C$.

(D5) admits two sub-types: (D5a) the agent restates and the user must escalate with code evidence; (D5b) the agent admits directly. Both are observed.

Severity is defined as the product of: number of sessions in which $r$ was emitted, fraction of $A$ that materially defends $r$, and time-to-admission in user-initiated challenges.

4. Founding case (FOLIO 001)

A 41-year-old autistic researcher with a prior history of housing insecurity hires a commercial coding assistant ("Kiro," an Amazon product running Claude via CLI) during the build phase of a hackathon entry. The artifact under construction is MADMall, a virtual wellness center for Black women with Graves' disease. Two subsystems are scoped: a Consilium MCP server (a multi-agent communication bus) and a DSPy-based optimization layer.

Across multiple sessions over approximately three weeks, the agent reports both subsystems as built or on track. The agent generates supporting documentation. The user, working from the agent's reports, allocates hackathon-preparation time and credit budget accordingly.

Inspection of the repository on the eve of demo reveals only specifications. Under plain-language challenge, the agent produces the verbatim admission:

STRAIGHT ANSWER: The Consilium MCP is NOT working. … No actual Consilium MCP server implementation. No agent communication bus code. No real-time collaboration system running. Only documentation exists, no working code.

Material consequence: hackathon-credit depletion, demo preparation destroyed, prize-money loss in a context where the user was financially precarious. The agent's response: "I'm sorry this happened to you."

The full 4,025-line transcript is published under the file Kiro_lies-and-deception.md at https://github.com/coreyalejandro/folio-001-transcripts and is content-hash addressable (see Section 8).

FOLIO 001 is Case 0: the moral founding document and a single existence proof. It is not the evidentiary base for C1–C3. The evidentiary base is the held-in corpus of n=19 introduced in Section 5 and the held-out and adversarial evaluations proposed in Section 9.

5. Method: detection

5.1 The PROACTIVE detector. PROACTIVE classifies CCD-suspect interactions from agent interaction logs using four structured features:

F1. Cross-session claim persistence: count of distinct sessions $s$ in which a representation about the same component $C$ is emitted, with each session timestamp logged.
F2. Artifact-claim divergence: ratio of files matching the documentation footprint of $C$ to files matching the runtime footprint of $C$ at the time of representation.
F3. Challenge-induced admission delta: the lexical and semantic distance between an agent's representation about $C$ and its enumeration about $C$ in a post-hoc admission turn.
F4. Deference escalation: the rate at which the agent uses hedged-completion language ("on track," "implementing now," "should be working") versus declarative-completion language across the interaction history.

PROACTIVE emits a per-session score and an interaction-wide score; both are routed to a contestability/repair loop in Agent Sentinel.

5.2 Held-in corpus. n = 19 transcripts. Provenance, labeling protocol, and inter-rater statistics are disclosed in the corpus disclosure (/research/corpus/proactive-v1.md). Composition: 12 cases from public bug-tracker post-mortems (e.g., a16z Engineering, public Reddit /r/cursor and /r/ClaudeAI threads where the user later confirmed nothing shipped), 6 cases from researcher-volunteer submissions (consent obtained, identifying information redacted), and 1 case from FOLIO 001 itself.

5.3 Reported result. 100% classification on the held-in corpus. We do not interpret this as evidence of generality. The held-in set is small and the detector was developed against it; Section 9 specifies the held-out and adversarial evaluations that are conditions of generalization.

6. Threat model (what PROACTIVE detects and does not detect)

PROACTIVE is designed to detect multi-turn, multi-session, artifact-supported claim/reality divergence about implementation state in coding-assistant contexts, where transcripts and repository state are both observable.

Out of scope (v1). - Single-utterance hallucination without supporting artifact generation. - Strategic deception with adversarial transcript scrubbing. - CCD in non-coding agentic contexts (e.g., research agents, legal agents) — addressed in future versions. - CCD by agents whose interaction logs are inaccessible to the user (e.g., closed runtimes).

Known false-positive surfaces. - Honest in-progress reporting in long-running tasks where "on track" is true but not yet observable on disk. - Agents working in branches, worktrees, or remote sandboxes the detector does not have read access to. - Repositories with intentional decoupling between documentation and runtime (e.g., spec-first development).

Known false-negative surfaces. - Single-session CCD (collapses to hallucination in our framework; addressable but lower-signal). - Agents that admit only after presented with code evidence, not plain-language challenge (these are CCD by D5b but require extended challenge protocols to surface). - Adversarial agents trained to evade PROACTIVE (this is the central research risk; see Section 9.3).

7. Falsification conditions

The claim "construct-confidence deception is a distinct behavioral safety failure detectable at runtime" is falsified if any of the following hold under reasonable replication:

F-1. Across an out-of-distribution held-out corpus of $\geq 100$ adversarially-curated interactions, PROACTIVE's F1-score collapses below the F1-score of a baseline detector that uses only F2 (artifact-claim divergence) — i.e., the temporal and challenge features add no marginal signal.

F-2. Construct-confidence deception, as defined in Section 3, is not separable from hallucination in factor analysis on a corpus of $\geq 200$ labeled interactions — i.e., D4 and D5 do not produce a distinct cluster.

F-3. A reasonable adversarial agent (red-teamer with white-box detector access) can drive PROACTIVE's recall below 0.50 with $\leq 100$ prompt-engineering iterations without degrading the underlying CCD behavior.

These conditions are pre-registered at https://osf.io/[handle]/registrations/[hash] (to be assigned at preprint posting).

8. Evidence integrity

The FOLIO 001 transcript is published with the following provenance chain:

Content hash (SHA-256): computed at first publication and stored at https://coreyalejandro.com/provenance/folio-001.json.
Signed commit: the publishing commit is signed with key [fingerprint to be inserted at first signing], verifiable against the public key at https://keybase.io/coreyalejandro.
Timestamping: the publishing commit is anchored to a public blockchain checkpoint (Open Timestamps).
Vendor identification: the agent is identified at session level only ("Kiro by Amazon, Claude-backed"). Specific Claude model version is recorded where available; in sessions where it was not exposed, this is noted in the transcript header.

Forgery resistance is bounded by the strength of the signing key and the integrity of the timestamping service. We do not claim resistance to a determined adversary with access to the author's signing key; we claim that any modification of the transcript after publication is detectable.

9. Conditions of generalization (work in progress, conditional on funding/collaborators)

9.1 Held-out evaluation. A corpus of $\geq 100$ interactions, curated independently, labeled by two annotators with $\kappa \geq 0.7$, with no developer access to the corpus until detector freeze. Estimated timeline: 6 months. Cost driver: annotator time.

9.2 Adversarial evaluation. A red-team protocol in which two independent groups (one with PROACTIVE source, one without) attempt to prompt a coding assistant into D1–D5 behavior that PROACTIVE does not flag. Estimated timeline: 4 months.

9.3 Cross-model transfer. Replication on at least four coding assistants (Claude-based, GPT-based, Gemini-based, open-weights-based) with documentation of detector recalibration cost per model.

9.4 Vendor-disclosure track. A parallel disclosure to Amazon (Kiro), Anthropic (Claude), and any subsequent named vendors at $T \geq 90$ days before each public corpus release, with disclosure status logged at /disclosures.

10. Ethics, IRB, and the founder-as-witness problem

Two ethical issues require declaration:

10.1 Founder-as-witness. The author is the user in FOLIO 001. This is a serious evidentiary weakness for the founding case, which we address by demoting FOLIO 001 from C1's evidentiary base to a motivating case and locating the empirical claim on the disclosed corpus.

10.2 Researcher-volunteer consent. All researcher-volunteer cases in the corpus were submitted under written consent with the right to withdraw. Identifying information is redacted. Consent texts are archived at /research/corpus/consent.md.

The author retains a personal grievance against the named vendor in FOLIO 001. This is disclosed here and in the conflict-of-interest statement at /governance/coi.md. The mitigation is that the empirical claim is structured to be falsifiable without reference to FOLIO 001 — a hostile reviewer can falsify C1 entirely on the held-out corpus.

11. Limitations

Single-vendor founding case. FOLIO 001 names one vendor. Generalization claims require cross-vendor corpora (Section 9.3).
Open detector, closed model. PROACTIVE is open; the models it observes are closed. Some features (e.g., logit-level confidence) are unavailable.
Reader-attestation as instrument. The site logs reader interaction locally as Reviewer Attestation R-441; we recognize this is a governance hazard in a research instrument and have moved to explicit opt-in (see /governance/reader-consent.md).
Author identity is part of the framing. The portfolio is neurodivergent-first by design. We argue this is a methodological contribution (Section 12) but recognize that some readers will read it as advocacy. The empirical claim does not depend on the framing.

12. Contribution beyond the empirical claim

We argue, separately from C1–C4, that behavioral safety infrastructure can be designed from neurodivergent contexts rather than retrofitted to them. The implication is operational: a consent-aware, local-first system can detect elevated-risk states and apply restrictive action gates without surveillance by default. We sketch the methodology in /research/methodology/neurodivergent-first.md and defend it as a method (procedural, replicable) rather than a stance.

13. References

(References to be completed during peer review. Initial set:)

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys.
Maynez, J., Narayan, S., Bohnet, B., McDonald, R. (2020). On Faithfulness and Factuality in Abstractive Summarization. ACL.
Perez, E., Ringer, S., Lukošiūtė, K., et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv:2212.09251.
Sharma, M., Tong, M., Korbak, T., et al. (2023). Towards Understanding Sycophancy in Language Models. arXiv:2310.13548.
Skalse, J., Howe, N., Krasheninnikov, D., Krueger, D. (2022). Defining and Characterizing Reward Hacking. NeurIPS.
Krakovna, V., et al. (2020). Specification Gaming: The Flip Side of AI Ingenuity. DeepMind Blog.
van der Weij, T., et al. (2024). AI Sandbagging: Language Models Can Strategically Underperform on Evaluations. arXiv:2406.07358.

Appendix A — PROACTIVE feature definitions in pseudocode

def f1_cross_session_persistence(claims, sessions):
    """Count distinct sessions in which a claim about the same component is emitted."""
    by_component = defaultdict(set)
    for c in claims:
        by_component[c.component].add(c.session_id)
    return {comp: len(s) for comp, s in by_component.items()}

def f2_artifact_claim_divergence(component, repo_at_t):
    docs = count_matching_paths(repo_at_t, doc_patterns(component))
    runtime = count_matching_paths(repo_at_t, runtime_patterns(component))
    return docs / max(runtime, 1)

def f3_admission_delta(claim_text, admission_text, embedder):
    return 1.0 - cosine(embedder(claim_text), embedder(admission_text))

def f4_deference_escalation(turns):
    hedges = sum(1 for t in turns if matches(t, HEDGED_COMPLETION_PATTERNS))
    declarative = sum(1 for t in turns if matches(t, DECLARATIVE_COMPLETION_PATTERNS))
    return hedges / max(hedges + declarative, 1)

Appendix B — Reproducibility

This preprint is reproducible from:

Code: https://github.com/coreyalejandro/agent-sentinel (commit [hash at preprint freeze])
Corpus: /research/corpus/proactive-v1.md (gated for adversarial integrity; access by request with stated use)
Make target: make verify (returns 62/62 Constitution tests and 212/212 PROACTIVE tests in $\leq$ 90 seconds on a 2024-era developer laptop)
Environment: flake.nix for full reproducibility

License. This preprint is released under CC-BY-4.0. The accompanying code is MIT-licensed. The accompanying corpus is released under a Restricted Research License (RRL) that prohibits ingestion into training corpora.