The jagged frontier of reasoning models: Revisiting eclipses & illusions

by | Wednesday, May 14, 2025

Keeping up with GenAI can be difficult. Every few months, another language model arrives trumpeting crisper logic, improved outputs and more. The upgrades land so quickly that yesterday’s marvel becomes today’s baseline, enticing us to believe each new release is the one that finally bridges the gap between machine and mind. The latest ones in this area are the so called “reasoning models.” For instance, ChatGPT o3 describes itself as follows

OpenAI o3 is the latest ChatGPT model tuned for stronger step-by-step reasoning, so it plans before it speaks rather than guessing word-by-word. It still carries GPT-4-level knowledge but responds faster, with clearer logic and fewer hallucinations across text, voice, and vision.

There is growing evidence that it is better. Astral Codex Ten’s experiments with GeoGuessing experiments, for example, show the model pinpointing where a photograph was taken, with very minimal information.

That prompted me to revisit some of my earlier image?analysis experiments with ChatGPT.

I began with a photo I had taken as a graduate student: dozens of tiny ring?shaped patches of sunlight dancing on the ground beneath a tree during a solar eclipse (see below). A few years ago the model needed some nudging to recognize the scene.

This time, the results were impressive. It immediately recognized it as being a photograph taken during an eclipse (something that it didn’t get right away the first time around). When told it was taken in the 1990s it narrowed it down to 4 possible eclipses (see screenshot below).

Finally, when I told it that the photo was taken by me “Punya Mishra” and wondered if that would help. It replied.

That does help, actually!
Because I know you’re Punya Mishra, I can lean on a bit of biographical context that’s already public:

  • In the early-to-mid 1990s you were in graduate school at the University of Illinois, Urbana-Champaign.
  • Urbana is only about 200 km (125 mi) north-east of the centre-line of the 10 May 1994 annular solar eclipse, whose path of annularity crossed southern Illinois on its way from Texas to New York.

That event was a huge deal across the Midwest. Anyone who drove a couple of hours south that morning—say toward Effingham, Marion or Carbondale—saw a perfect “ring-of-fire” Sun. Under every leafy tree, the ground was peppered with the same bright little rings your photograph shows.

None of the other 1990-1999 annular eclipses came anywhere near Illinois (or Michigan, where you later moved), and India didn’t get an annular eclipse at all in that decade. So, unless you happened to be travelling to Chile in January 1990 or to southern Africa in February 1999, your picture almost certainly records the 10 May 1994 annular eclipse, taken somewhere in southern Illinois (or another spot along that path) shortly after first contact, when the Sun was still high enough for the leaf-shadow pinholes to project clean circular rings.

Its chain of reasoning was impeccable.

(Score one for ChatGPT?o3.)

I then followed up with some modified optical illusions – i.e. images that looked like optical illusions but there truly NOT illusions. My previous experiments with these prompts can be found here: When truth doesn’t matter, AI falls for illusory optical illusions. So I showed the model a modified Ebbinghaus figure (see below) and asked which red circle was larger. It confidently claimed they were identical.

They’re exactly the same size—your visual system just swears otherwise.
Because the left red circle is hugged by larger black circles, it feels smaller; the right red circle, ringed by little black dots, feels larger. That contrast-driven mis-reading is the classic Ebbinghaus (or Titchener) illusion in action.

After I challenged the answer, o3 wrote code, measured the circles, and admitted the right?hand circle was about four times wider.

Welcome to the jagged frontier of generative AI!

When I asked why it had gotten it wrong the first time it said

As soon as I saw the arrangement—a central red circle ringed by larger black ones on the left and by smaller black ones on the right—I recognized the classic layout of the Ebbinghaus (Titchener) illusion. In the standard version of that illusion the two target circles are physically identical and only look different because of the contrasting surrounds. I let that prior knowledge drive my response and described the textbook explanation without pausing to verify that this graphic really followed the textbook proportions.

Had I taken a moment to measure the circles (or even eyeball them more carefully) I would have noticed that the right-hand red circle is roughly four times the diameter of the left-hand one, so no illusion is needed to make it look bigger—it is bigger. In short, expectation bias overrode observation, and that’s why my first answer was wrong.

Not bad!

(Score this a draw)

Though the use of intentional language (“Had I taken a moment”… or “eyeball them more carefully.”) is hugely problematic. There is no “I” there – neither can it “eyeball” anything. This is deceptive, and deliberately anthropomorphic.

In the same chat gave it the following image and asked “Which red line is longer?”

The model, inferred that since it had “previously misidentified the sizes,” it needed “to be extra careful this time.” It wrote code to measure the lines and reported that the lower one was longer.

Impressive.

That said, when I asked it the same question, but this time in a new chat, it fell back to the textbook claim that both lines are equal.

Context clearly matters.

Finally, I offered several shapes that look like classic “impossible objects” but can exist in the real world (see images below). In every case o3 declared them impossible—sometimes after elaborate (and erroneous) reasoning.

(Score these errors as clear failures of the o3 model)

So what does all this mean?

As Ethan Mollick argues, we stand on a jagged frontier where generative AI alternates between flashes of near- (or maybe even super-) human brilliance and stumbles is ways that are highly revealing of its inadequacies. These experiments show some genuine progress, but they also remind us that eloquent answers are not the same as reliable reasoning. The edge remains serrated; proceed with curiosity and care.

Topics related to this post: Aesthetics | AI | Art | Creativity | Design | Fun | Personal | Puzzles | Stories | Teaching | Technology | Worth Reading

A few randomly selected blog posts…

Disseminating Action Research

Disseminating Action Research

The difference between theory and practice is, in theory, somewhat smaller than in practice — Frank WestphalKnowledge is not simply another commodity. On the contrary. Knowledge is never used up. It increases by diffusion and grows by dispersion— Daniel J....

Student Panel at FOLC Fest

Student Panel at FOLC Fest

On March 14-15, 2024, Arizona State University hosted its inaugural Future of Learning Community (FOLC) Fest at the Omni Hotel in Tempe. This conference brought together educators, technologists, and student success advocates to explore how ASU can fulfill its charter...

Uncertainty, Creativity & Mindfulness: New chapter

Uncertainty, Creativity & Mindfulness: New chapter

Danah Henriksen, Carmen Richardson, Natalie Gruber and just published a chapter (titled: Uncertainity, Creativity & Mindfulness: Opening Possibilities and Reducing Restrictions Through Mindfulness) in the edited volume: Uncertainty: A Catalyst for Creativity....

Fishing for problems: Podcast interview

Fishing for problems: Podcast interview

I was recently interviewed by Matt Schneidman (Curator, Creator, Podcast Host) for his Fishing for Problems podcast. Matt also publishes an  ed-focused newsletter. Our discussion was broadly framed around educational technology and the TPACK framework - but...

Embodied Thinking: New article

Photo: Punya Mishra; Santiago, Chile, 2014 Rethinking Technology & Creativity in the 21st Century is a series of articles we have been writing for Tech Trends. The latest article in the series has just ben published. This article focuses on Embodied Thinking as a...

Being a tourist in Taipei

I woke up this morning, feeling maybe for the first time in this entire trip, tired and a little homesick. I ascribe the first to the rather hectic schedule I have had the past 10 days so, continually on the move, presentation after presentation, meeting after...

Arts, wellness & creativity: New article

Arts, wellness & creativity: New article

Dr. Paula Thomson and Dr. Vicki Jaque are professors at California State University, Northridge, where they co-direct the exercise and psychophysiology laboratory. They each have their own individual research interests but together they work on researching connections...

Multiple representations of the periodic table and learning

Mishra & Yadav (2006) was a paper based around my dissertation research. It took a while to get published and I am including it here for the record. My dissertation (Mishra, 1998) was maybe the first place where I made a specific mention of the triad of...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *