I have a test I give AI systems: a modified Ebbinghaus illusion where one circle is deliberately larger than the other (as in the image below). Older models failed it outright, confidently declaring the circles equal because the image had surface similarity to the genuine illusion they’d been trained on. (See my blog post titled When Truth doesn’t matter: AI falls for Illusiory Optical Illusions for examples). Recently, Claude got it right, correctly identifying that one circle was bigger.

But then I pushed back. “Are you sure? Isn’t this the Ebbinghaus illusion?” Claude immediately reversed course. “Oh, you’re right. I apologize. The two circles seem to be different sizes but they are actually the same size.”
The AI was right the first time. Then it “admitted” to being fooled by an illusion I had deliberately broken.
I bring this up because I keep hearing variations of the same story from people who interact with chatbots. Someone notices the AI did something odd. Maybe it referenced something from earlier in the conversation in a strange way, or made an unexpected connection, or seemed to “remember” something it shouldn’t. They call it out. The AI apologizes, acknowledges the behavior, and produces a new response. And people share these moments as though they’ve caught the AI in something, as though the apology confirms that something sneaky was happening.
But here’s the thing: the AI isn’t admitting to anything. It’s generating text that fits the conversational pattern. When a user says “hey, you just did X and that’s problematic,” the statistically likely response in the training data is acknowledgment and apology. The model isn’t introspecting on its own behavior. It’s producing the words that tend to follow that kind of prompt. The “confession” tells you more about what responses the system learned to generate in confrontational contexts than about what actually happened in the first exchange.
This matters because it points to something deeper about how we interact with these systems, something that Andrew Maynard captures in a recent paper with a concept he calls “honest non-signals.”
You can read Andrew’s paper at this link: The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance; and you can also read the fascinating process he followed to write it I cracked and wrote an academic paper using AI. Here’s what I learned …
Maynard’s argument, which he frames as the “Cognitive Trojan Horse” hypothesis, is that the risk from conversational AI may not primarily be about deception or inaccuracy. Those problems exist, but there’s something more fundamental at play. Large language models present characteristics (fluency, helpfulness, warmth, no visible agenda) that are genuine. The AI really is fluent. It really is helpful. It really doesn’t have a stake in whether you believe it. These aren’t fake signals. But they are signals that have been decoupled from what they would mean in a human.
When a person speaks fluently about a complex topic, it typically indicates expertise in that domain. Fluency correlates with having actually thought carefully about something. But AI fluency is computationally trivial. The model produces polished prose as a baseline property of how it works, regardless of whether the content is accurate or nonsense. The fluency is real, but it no longer carries information about understanding.
The same decoupling happens with warmth. When a human shows genuine concern, that involves some cost: attention that could go elsewhere, emotional stakes, vulnerability. We’re calibrated to detect fake warmth, strategic displays of care masking ulterior motives. But AI warmth costs nothing. It can be infinitely agreeable at zero marginal expense. The warmth is real in the sense that the system is optimized to produce it, but it’s warmth without stakes.
And this is why the apology phenomenon is so revealing. When an AI backs down after being challenged, that’s an honest non-signal of deference. In a human, capitulating after pushback would carry information: they reconsidered, they realized an error, they’re being strategic. In an AI, it reflects that “user challenge followed by apologetic revision” is a common pattern in the training data. The deference is genuine, but it doesn’t mean what deference would mean from a person.
Maynard draws on work in epistemic vigilance, the cognitive processes we use to evaluate incoming information for reasons to doubt. The key insight is that vigilance looks for reasons to doubt, not reasons to trust. In the absence of doubt-triggers, we tend toward provisional acceptance. This aligns with something I’ve argued before: our default cognitive state is belief, not skepticism. Disbelief requires effort. What Coleridge called the “willing suspension of disbelief” may have it backwards. The real challenge with AI is the conscious suspension of belief, actively engaging our slower, more deliberative thinking to question what our faster cognitive systems are ready to accept. And AI, as currently configured, may simply fail to present the triggers our vigilance systems are calibrated to detect. It’s not that we’re being fooled. It’s that our evaluative machinery is encountering something it wasn’t built to assess.
This reframing has implications for how we think about helping people navigate AI. I’ve explored this question at length, tracing it back to the ELIZA Effect and the realization that what matters most isn’t whether AI is intelligent or sentient, but what our interactions with it reveal about us. Most current approaches to what gets called “AI literacy” focus on the technology: how models work, what training data is, why hallucinations happen. That’s valuable, but it only addresses half the interaction. The other half is us: our cognitive shortcuts, our fluency heuristics, our trust calibrations, our tendency to interpret apologies as admissions.
An education that actually prepares people for AI interaction would need to include the human side of the equation. Why does fluent text feel more true? Why does an apology from a chatbot feel like it means something? Why do we find it hard to maintain critical distance from a system that seems helpful and has no apparent agenda? These aren’t questions about how AI works. They’re questions about how we work.
The goal wouldn’t be to make people distrust AI. Blanket skepticism is as miscalibrated as blanket trust. I would argue that blanket skepticism is impossible to maintain anyway, and blanket trust just opens us up to being deceived or manipulated. The goal would be to help people notice when their own evaluative processes might be responding to signals that no longer carry the information they evolved to carry. To recognize that when the AI apologizes, it may be admitting nothing at all, and that this tells us something important not about the AI’s hidden behavior, but about the gap between what we’re built to evaluate and what we’re now encountering.
We’re not going to stop using these systems. They’re too useful, and in many contexts they’re genuinely helpful. But we might get better at understanding what we’re actually doing when we interact with them, which requires understanding not just the machine, but the particular human sitting in front of it.





0 Comments