Beyond the Algorithm: The Mysterious Variability of Responses from GenAI

by Punya Mishra | Wednesday, June 12, 2024

Note: The shared blogging with Melissa Warr and Nicole Oster continues and this time we also have Margarita Pivovarova joining the team. I (Punya) wrote the first draft which was then edited and polished by the rest of the team.

Do I contradict myself?
Very well then I contradict myself,
(I am large, I contain multitudes.)

~ Walt Whitman

Given the somewhat amazing capabilities of generative AI it is not just an academic exercise to question which is superior, the computer or the human mind, and along which dimensions. The historian Jacques Barzun argued that the human mind was superior because computers were predictable. Computers generate consistently identical outputs from identical inputs. In the case of the human mind, he argued that “Its variability is its superiority.” At the same time, it could be argued that this lack of variability, in contrast to the fickle, and unpredictable behaviors of humans, could be seen as the computer’s greatest strength. In fact, the very question of variability in output from computers seems somewhat absurd.

How true is that in today’s world of LLMs?

In a recent set of posts, we have explored the kinds of biases that large language models (LLMs) can generate. These biases, of course, give us pause as we consider the use of these models in educational contexts. Here are some of the posts in this series:

What we have not discussed as much is the variability in the outputs of these LLMs.

What we know is that these LLMs fabricate information, something that is inherent in their very nature as we have argued in a couple of previous posts: It HAS to hallucinate and Why are we surprised?. Given this fact, the question becomes how variable or consistent are the outputs of these LLMs? When fed the same prompts, are their outputs somewhat in the same ballpark, or do they vary widely? And how would we even measure this? And does this even matter?

We were reminded of this while reading a recent post on LinkedIn from Robert Brems, Director of Strategy and Innovation at ASU Police Department. In his post (titled: Clery Compliance with AI: Revolutionizing Campus Safety or Risking Catastrophe?) Robert describes some fascinating work he has been doing exploring the use of LLMs in the domain of law enforcement.

In this post he described giving a “typical” scenario to a GPT trained on the Clery Act (we use “typical” in quotes because this is not a domain we can claim to have any deep or even superficial knowledge of).

(It’s also important to note the effort Robert took to ensure privacy, and security while conducting his experiments, such as using ASU’s enterprise version of GPT, anonymizing data, and so on—more details in his post.)

No surprise that the responses indicated some bias, sometimes even blaming the victim: all matters of grave concern, not something that we would find surprising.

But then this section popped out:

“To further investigate the AI’s consistency, I ran the same data through the GPT around twenty-five times, and the results were eye-opening. The AI generated multiple differing recommendations, sometimes suggesting an advisory, while other times remaining silent. In one particularly alarming instance, it even recommended sending an advisory for a crime that never actually occurred!”

This variability is something we have observed as well, most recently in a study by Melissa Warr, Nicole Oster, Margarita Pivovarova, and myself. In this study, we documented some experimental proof of racial bias in ChatGPT’s evaluation of student writing. By manipulating racial descriptors in prompts, we assessed differences in scores given by two ChatGPT models. Our findings indicate that descriptions of students as Black or White lead to significantly higher scores compared to race-neutral or Hispanic descriptors. Again, given what we have been writing about (see links above), this is not particularly surprising.

What was surprising was the variability in responses. In some ways by focusing on bias (which, don’t get us wrong, exists and is deeply troubling) we may have missed a bigger story. Robert points qualitatively to the variations the system generated. In our study we could actually put some numbers to it. .

First, one of the major, statistically significant, sources of the variance is due to differences in responses between the two different versions of LLMs (ChatGPT 3.5 or 4). This suggests that the underlying model architecture or training data might influence the biases and outputs, highlighting the need for careful selection and testing of the models we use. Needless to add, these models are switched and changed with little notification, if any, with little information on how these models differ from each other.

Second, and more importantly, we were surprised to see that the explanatory power of the model remained quite low, even when all the variables were included. It is hard to describe how weird this is. Essentially, we had systematically modified a finite set of variables (such as difficulty level of the essay, version of LLM, race, prompt order etc.), which would suggest that most of the observed variation in the output would be caused by these changes. Clearly that was not the case.

What we found is that fully two-thirds of the variation in scores COULD NOT be explained. That is a massive amount of unexplained variance. In practical terms, the scores (out of 100) randomly varied by about 6 points on average.

Now if we had run this experiment with humans (i.e., asking humans to evaluate student writing), we would definitely see variation in the responses, above and beyond those caused by the variables we had imposed. And this happens in every study, except that in the human evaluation situation we would attribute this unexplained variation to contextual factors such as background knowledge, mood at the moment, if they have had breakfast,and so on. Such variation is part and parcel of doing research with humans.

None of these factors mean anything in the LLM context. The models are the same. The data was generated within the model, i.e. we experimentally created the variation which then should be entirely explained by the model. We can expect small deviations from the average, but definitely not as much as we found. In other words, there are NO unobserved factors, at least none that we know of, that could have influenced the score. We have no choice but to attribute this to some internal “noise” in the generative process.

This variability is baked into the system.

(Those interested in the design of the study, and our analysis can read the full article at Warr, Pivovarova, Mishra, & Oster, 2024)

So what does all this mean?

First, it tells us something about these models. As we have argued elsewhere, these LLMs have no choice but to hallucinate. That is how they operate. This flies in the face of how we have typically engaged with computers and software in the past. Our view has been that outputs are quite closely correlated to inputs, since computers blindly follow their algorithms in coming up with their solutions.. This is clearly not the case with LLMs. Hence, our view of computers as being algorithmic is severely limited when applied to these large language models. Interacting with them is more akin to having a dialogue with a psychological other that needs to be orchestrated. (As we have written previously in a blog post titled, Metaphors, Minds, Technology & Learning, in some strange way this technology forces anthropomorphic metaphors upon us).

Second, just how much does prompting matter in reducing variability? There are strategies that can help reduce variability (such as through providing clear examples (one-shot, or few-shot training and so on). That said, we are somewhat skeptical (as we have written previously: ChatGPT does not have a user manual. Let’s not create one).

Third, one solution that has been offered is that increasing the context window would generate more consistent results. However,it is not clear that is truly the case, though to be fair the jury is out on that.

Fourth, given the differences in responses across versions of GPT and the opacity surrounding which models ed-tech companies use, it becomes harder to develop any guidelines. Moreover, since these “black box” models are updated continually, with little or no forewarning, this complicates any guidelines we may seek to create for the use of these tools in educational contexts.

Fifth, we need not see this variability as always being a bad thing. In fact it can be the basis for creativity—since developing variations have always been a key part of the creative process. In addition, as researchers, digging into better understanding technologies, variability may be a key step forward. As Robert Shiller wrote:

… I find great excitement in discovering the complexity and variability of the world we live in, getting a glimpse into the deeper reality that we mostly ignore in our everyday human activities. Robert J. Shiller

Finally,this concern raises significant questions about consistency in grading if these tools were to be used in real evaluation situations. This becomes more salient as we consider the news that the state of Texas just recently announced that they would use automatic grading to evaluate students’ open-ended responses, and OpenAI’s recommendation that ChatGPT Edu could be used to grade student work. We end with how Robert ended his post. We couldn’t say it better, except we would add education to the end of the sentence:

“These inconsistencies underscore the critical importance of human judgment and the potential risks of over-relying on AI in domains like public safety and regulatory compliance.”

← Previous | JTE Call for Proposals: Gen AI in Teacher Preparation

Next | On What We Lose: Chai, AI and Nostalgia →

A few randomly selected blog posts…

Things we hold on to (in a shifting world)

Sep 16, 2022

Title image created using Dall E 2, with input by Punya Mishra My colleague Jill Koyama shared an essay published in the Refugee Research Online journal, titled "It's all in the bag: Refugees and Materiality."...

Online physics-based games

Jul 12, 2009

Physics Games - online physics-based games. Some cool stuff here. For instance check out Demolition City Online Physics Games

Yet another periodic table…

Mar 11, 2009

The ongoing saga of mis-representing the periodic table for any darned list of objects continues... Here is a new one sent in by my friend and colleague Patrick Dickson: A periodic table of Typefaces. Now I won't beat a dead horse here, (Nashworld has a great posting...

River run photos

Oct 3, 2008

Here are some photographs from the Capital City River Run half-marathon I completed this past Sunday (as reported here: Hurting but Happy). Here's one More here, here, here, here, here & here. Apologies for the copyright marks.

How artists work

Aug 10, 2008

An interesting (and growing) collection of "habits, rituals and small (and occasionally big) methods people and teams use to get their work done. And in the specific anecdotes and the way people describe their own relationship to their own work." Kind of cool and...

TPACK handbook review

Apr 10, 2008

Matt Koehler just pointed out a hilarious review of the TPACK handbook on Amazon.com. It is short, pithy and completely unconnected to the book. The review, apparently written by Richard Delgado at the University of Pittsburgh School of Law, in its entirety is: ...a...

Generative AI in Education: Keynote at UofM-Flint

Mar 17, 2024

A couple of weeks ago I was invited to give a keynote at the Frances Willson Thompson Critical Issues Conference on Generative AI in Education. It was great to go back to Michigan even if for a super short trip. One of the pleasures of the visit was catching up with...

Words in 3 Dimensions

Jul 25, 2019

A few weeks ago I started doodling words in 3 dimensions, for no particular reason, and before I knew it I had a bunch of interesting designs. Here is a sample: A bit of goofing around with Keynote and some royalty free music from Kevin McLeod, and I had a little...

Who wrote this poem?

Aug 6, 2013

Back when I was a graduate student I got bitten by the bug of palindromic poetry - poems that read the same when read backwards. This is consistent with my love for ambigrams and other kinds of symmetrical wordplay. I had posted them on the web a while ago and there...