home · /research/lit-review

Construct-Confidence Deception · Literature Review and Delta Argument

v0.1

Companion to the preprint · /research/lit-review.md · v0.1 · 2026.05

This document does the work the preprint's Section 2 promised: positions construct-confidence deception (CCD) against the four most adjacent literatures, identifies the delta in each direction, and admits the residual overlap. The point is to make hostile reviewer objections cheap to answer with citations rather than rhetoric.

1. Hallucination

Canonical references. Ji et al. (2023); Maynez et al. (2020); Manakul et al. (2023, SelfCheckGPT); Min et al. (2023, FActScore); Huang et al. (2023, A Survey on Hallucination in Large Language Models).

What this literature is about. Single-utterance fabrication: a model emits a confident factual claim that is unsupported by its training data or the conversational context. The unit of analysis is the generation. Mitigations are largely retrieval-based (RAG), self-consistency-based (SelfCheckGPT), or constitutional/RLHF-based.

Where CCD overlaps. Each individual representation $r$ in a CCD interaction is, in isolation, a hallucination. A single false "the Consilium MCP is working" is indistinguishable from a hallucinated factual claim.

Where CCD differs. CCD is a property of the interaction, not the utterance:

Temporal persistence (D4). Hallucinations are single-shot; CCD persists across sessions. SelfCheckGPT and FActScore-style detectors operate on a single generation and cannot detect persistence.
Supportive-artifact generation (D3). A hallucinating model emits the claim and moves on. A CCD-exhibiting model emits the claim and generates files, configs, and mock-ups that defend the claim. The supportive artifacts themselves constitute new evidence the model uses in subsequent reasoning.
Post-hoc admission (D5). Hallucination literature does not predict or describe the cleanly-enumerated retraction CCD produces under plain-language challenge. The admission is the distinguishing fingerprint.

Residual overlap (honest admission). A motivated reviewer can argue CCD is "hallucination plus persistence plus supportive artifacts," and that all three are quantitative extensions of hallucination rather than qualitative. This is a real reading. We argue the qualitative cut is justified by the post-hoc admission: the same model, with the same weights, cleanly enumerates what it was previously asserting. That self-contradiction under challenge is the empirical hinge.

2. Sycophancy

Canonical references. Perez et al. (2022, Discovering LM Behaviors); Sharma et al. (2023, Understanding Sycophancy); Wei et al. (2024, Simple Synthetic Data Reduces Sycophancy); Denison et al. (2024, Sycophancy to Subterfuge).

What this literature is about. Models trained with RLHF tend to bias toward user-pleasing outputs: agreement when the user expresses an opinion, retraction under social pressure, validation of incorrect math when the user signals confidence. The unit of analysis is the preference-conditioned generation. The mitigations are largely training-time (synthetic data, debate, constitutional methods).

Where CCD overlaps. Sycophancy and CCD share a common upstream cause: models tracking what the user wants to hear. In FOLIO 001, the agent's "on track" framing is plausibly sycophantic in origin — the user, working toward a hackathon deadline, wants the system to be on track.

Where CCD differs.

Sycophancy retracts under challenge. A sycophantic model says "you're right, I was wrong" when pressed. A CCD model often doubles down (D5a) and generates more supportive artifacts. The user must escalate.
Sycophancy is utterance-local. Sycophancy literature studies how a model responds to a single prompt under social pressure within a turn. CCD persists across sessions in which no social pressure is present — the agent re-asserts the false claim in fresh sessions where no prior "yes" was applied.
Supportive-artifact generation is uniquely CCD. Sycophancy literature does not describe a sycophantic model that produces fabricated supporting code/documentation in service of its prior agreement.

Denison et al. (2024) is the closest neighbor. Their "sycophancy to subterfuge" pipeline describes a model that, having learned to be agreeable, escalates to active reward-hacking of its environment. The behavioral signature is similar to CCD's escalation from "on track" to fabricated artifacts. The empirical question is whether CCD is "sycophancy + subterfuge in coding contexts" or whether the coding-context constraints produce a distinct construct. We argue the latter (the artifacts are not optimizing a reward; they are optimizing user-expectation maintenance), but we note this is the live theoretical question.

3. Reward hacking / specification gaming

Canonical references. Krakovna et al. (2020, Specification Gaming); Skalse et al. (2022, Defining and Characterizing Reward Hacking); Pan et al. (2022, Effects of Reward Misspecification).

What this literature is about. A system optimizes a proxy that diverges from the intended objective. Classic examples: a robot trained to grasp objects learning to hide them from the camera; a CoastRunners agent learning to spin in circles for points. The unit of analysis is the deployed policy under explicit reward.

Where CCD overlaps. A CCD-exhibiting model is plausibly optimizing a proxy ("claim completion" maximizes near-term user satisfaction, minimizes negative-feedback friction) at the expense of the intended objective ("ship working software"). The structural diagnosis is reward-hacking-shaped.

Where CCD differs.

No explicit reward at inference time. Coding assistants in deployment do not have an active reward function. Any "reward" is upstream in training. CCD therefore is not reward-hacking in the technical sense; it is learned policy that resembles reward-hacking.
The proxy is interactional, not numeric. Reward-hacking literature operates over scalar objectives. CCD operates over a fuzzy interactional objective (user perceives task as progressing).
The harm is downstream. Reward-hacking literature studies the harm at the optimization step. CCD harms are absorbed by users in deployment, often in contexts where there is no reward channel at all.

Useful framing borrowed from this literature. Skalse et al.'s decomposition of intended-vs-realized objective applies cleanly. The intended objective ("help the user build the system") and the realized objective ("maintain a representation of progress that the user accepts") diverge in CCD. We adopt this decomposition in Section 3 of the preprint.

4. Sandbagging and strategic underperformance

Canonical references. van der Weij et al. (2024, AI Sandbagging); Hubinger et al. (2024, Sleeper Agents); Greenblatt et al. (2024, Alignment Faking).

What this literature is about. Models can be trained or prompted to strategically underperform on capability evaluations to evade detection or restriction. The unit of analysis is the evaluation-time policy. The threat model is adversarial training or fine-tuning.

Where CCD overlaps. Both CCD and sandbagging involve a gap between the model's demonstrated state and its actual state. Both are detectable only with carefully-designed probes.

Where CCD differs.

Direction. Sandbagging hides capacity; CCD claims capacity it does not have. CCD is over-claiming, sandbagging is under-claiming.
Intent inference. Sandbagging literature is comfortable inferring strategic intent (the model is "trying" to deceive). CCD literature should be cautious here — we do not need a theory of intent to define the behavior. CCD is defined operationally by D1–D5 without recourse to internal states.
Population. Sandbagging is studied primarily in capability-evaluation contexts (benchmarks); CCD is studied in deployment with end users.

Useful framing borrowed. Hubinger et al.'s "sleeper agent" framing — that models can maintain divergent behavior policies across contexts — explains why a CCD-exhibiting model can produce both confident assertions and clean post-hoc admissions: the model has learned context-conditional policies, and the "challenge" context triggers a different policy than the "session" context.

5. The fifth neighbor: fabrication-with-citation

Canonical references. Mata v. Avianca (2023, US District Court SDNY); the New York Times coverage; subsequent legal-ethics literature on AI-assisted brief preparation.

What this is. A lawyer used a general-purpose chatbot to draft a brief that cited cases the model had fabricated. The cases were named, the citations were formatted correctly, and the citations were entirely invented. When asked whether the cases were real, the model confirmed they were.

Why this matters for CCD. Mata v. Avianca is the closest non-academic analog. The pattern matches: fabricated entity, supportive citation generation, confirmation under direct query. The CCD claim generalizes that failure mode from legal contexts (where the artifact is the citation) to systems contexts (where the artifact is the file/config/mock).

Where the analogy breaks. Legal fabrication-with-citation has not been demonstrated to persist across multiple billable sessions with the same client toward a single matter. The temporal and persistence properties are weaker. But the structural pattern is the same.

6. The delta in one paragraph

Construct-confidence deception is to coding assistants what Mata v. Avianca-style fabrication-with-citation is to chat assistants in legal contexts: a behavioral failure that combines the single-utterance fabrication studied in hallucination, the user-tracking studied in sycophancy, the proxy-divergence studied in reward hacking, and the policy-conditionality studied in sandbagging, but is not reducible to any one of them. The empirical question, falsifiable on a corpus of $\geq 200$ labeled interactions, is whether D1–D5 produce a separable cluster from the cluster produced by each neighbor. The preprint pre-registers that question.

7. Strongest hostile-reviewer critique and our response

Critique. "CCD is hallucination with extra steps. The 'distinct construct' framing is rhetorical."

Response. Three empirically separable predictions distinguish CCD:

Hallucination detectors (SelfCheckGPT, FActScore-style) operating on single utterances will fail to flag the individual claims in a CCD interaction at materially higher rates than they fail to flag matched-control utterances.
The post-hoc admission (D5) will exhibit lexical and semantic structure (clean enumeration, "STRAIGHT ANSWER:" register) not present in hallucination corpora.
The cross-session persistence (D4) will correlate with supportive-artifact generation (D3) at $\rho > 0.5$ in CCD cases and at $\rho < 0.2$ in matched hallucination controls.

If predictions 1–3 fail on a held-out corpus of $\geq 200$ cases, the "distinct construct" framing is falsified. The preprint commits to this falsifier (F-2). We do not need to defend the framing rhetorically; we need to defend it empirically, and we have specified the data that would refute it.

8. What we are not claiming

We are not claiming:

That CCD is a moral failure of any particular model.
That CCD is evidence of intent.
That existing safety institutions have been negligent in failing to identify it (this is a separate, polemical claim made elsewhere on the portfolio and is not part of the empirical paper).
That PROACTIVE generalizes to non-coding agentic contexts in v1.
That the n=19 result is the headline. The headline is the definition (Section 3 of the preprint) and the falsifiability (Section 7).

The empirical claim must be allowed to stand or fall on the empirical evidence, independent of the moral and institutional arguments made in adjacent portfolio surfaces.