Beyond the Algorithm: The Mysterious Variability of Responses from GenAI  

by | Wednesday, June 12, 2024

Note: The shared blogging with Melissa Warr and Nicole Oster continues and this time we also have Margarita Pivovarova joining the team. I (Punya) wrote the first draft which was then edited and polished by the rest of the team.  

Do I contradict myself?
Very well then I contradict myself,
(I am large, I contain multitudes.)

~ Walt Whitman 

Given the somewhat amazing capabilities of generative AI it is not just an academic exercise to question which is superior, the computer or the human mind, and along which dimensions. The historian Jacques Barzun argued that the human mind was superior because computers were predictable. Computers generate consistently identical outputs from identical inputs. In the case of the human mind, he argued that “Its variability is its superiority.” At the same time, it could be argued that this lack of variability, in contrast to the fickle, and unpredictable behaviors of humans, could be seen as the computer’s greatest strength. In fact, the very question of variability in output from computers seems somewhat absurd. 

How true is that in today’s world of LLMs? 

In a recent set of posts, we have explored the kinds of biases that large language models (LLMs) can generate. These biases, of course, give us pause as we consider the use of these models in educational contexts. Here are some of the posts in this series:

  1. GenAI is Racist. Period. 
  2. Racist or just biased? It’s complicated
  3. Implicit Bias in AI systems 

What we have not discussed as much is the variability in the outputs of these LLMs. 

What we know is that these LLMs fabricate information, something that is inherent in their very nature as we have argued in a couple of previous posts: It HAS to hallucinate and  Why are we surprised?. Given this fact, the question becomes how variable or consistent are the outputs of these LLMs? When fed the same prompts, are their outputs somewhat in the same ballpark, or do they vary widely? And how would we even measure this? And does this even matter?  

We were reminded of this while reading a recent post on LinkedIn from Robert Brems, Director of Strategy and Innovation at ASU Police Department. In his post (titled: Clery Compliance with AI: Revolutionizing Campus Safety or Risking Catastrophe?) Robert describes some fascinating work he has been doing exploring the use of LLMs in the domain of law enforcement. 

In this post he described giving a “typical” scenario to a GPT trained on the Clery Act (we use “typical” in quotes because this is not a domain we can claim to have any deep or even superficial knowledge of).

(It’s also important to note the effort Robert took to ensure privacy, and security while conducting his experiments, such as using ASU’s enterprise version of GPT, anonymizing data, and so on—more details in his post.)

No surprise that the responses indicated some bias, sometimes even blaming the victim: all matters of grave concern, not something that we would find surprising. 

But then this section popped out:

“To further investigate the AI’s consistency, I ran the same data through the GPT around twenty-five times, and the results were eye-opening. The AI generated multiple differing recommendations, sometimes suggesting an advisory, while other times remaining silent. In one particularly alarming instance, it even recommended sending an advisory for a crime that never actually occurred!”

This variability is something we have observed as well, most recently in a  study by Melissa Warr, Nicole Oster, Margarita Pivovarova, and myself. In this study, we documented some experimental proof of racial bias in ChatGPT’s evaluation of student writing. By manipulating racial descriptors in prompts, we assessed differences in scores given by two ChatGPT models. Our findings indicate that descriptions of students as Black or White lead to significantly higher scores compared to race-neutral or Hispanic descriptors. Again, given what we have been writing about (see links above), this is not particularly surprising.

What was surprising was the variability in responses. In some ways by focusing on bias (which, don’t get us wrong, exists and is deeply troubling) we may have missed a bigger story. Robert points qualitatively to the variations the system generated. In our study we could actually put some numbers to it. . 

First, one of the major, statistically significant, sources of the variance is due to differences in responses between the two different versions of LLMs (ChatGPT 3.5 or 4). This suggests that the underlying model architecture or training data might influence the biases and outputs, highlighting the need for careful selection and testing of the models we use. Needless to add, these models are switched and changed with little notification, if any, with little information on how these models differ from each other. 

Second, and more importantly, we were surprised to see that the explanatory power of the model remained quite low, even when all the variables were included. It is hard to describe how weird  this is. Essentially, we had systematically modified a finite set of variables (such as difficulty level of the essay, version of LLM, race, prompt order etc.), which would suggest that most of the observed variation in the output would be caused by these changes. Clearly that was not the case. 

What we found is that fully two-thirds of the variation in scores COULD NOT be explained. That is a massive amount of unexplained variance. In practical terms, the scores (out of 100) randomly varied by about 6 points on average.

Now if we had run this experiment with humans (i.e., asking humans to evaluate student writing), we would definitely see variation in the responses, above and beyond those caused by the variables we had imposed. And this happens in every study, except that in the human evaluation situation we would attribute this unexplained variation to contextual factors such as background knowledge, mood at the moment, if they have had breakfast,and so on. Such variation is part and parcel of doing research with humans.  

None of these factors mean anything in the LLM context. The models are the same. The data was generated within the model, i.e. we experimentally created the variation which then should be entirely explained by the model. We can expect small deviations from the average, but definitely not as much as we found. In other words, there are NO unobserved factors, at least none that we know of, that could have influenced the score. We have no choice but to attribute this to some internal “noise” in the generative process. 

This variability is baked into the system. 

(Those interested in the design of the study, and our analysis can read the full article at Warr, Pivovarova, Mishra, & Oster, 2024

So what does all this mean? 

First, it tells us something about these models. As we have argued elsewhere, these LLMs have no choice but to hallucinate. That is how they operate. This flies in the face of how we have typically engaged with computers and software in the past. Our view has been that outputs are quite closely correlated to inputs, since computers blindly follow their algorithms in coming up with their solutions.. This is clearly not the case with LLMs. Hence, our view of computers as being algorithmic is severely limited when applied to these large language models. Interacting with them is more akin to having a dialogue with a psychological other that needs to be orchestrated. (As we have written previously in a blog post titled, Metaphors, Minds, Technology & Learning, in some strange way this technology forces anthropomorphic metaphors upon us).  

Second, just how much does prompting matter in reducing variability? There are strategies that can help reduce variability (such as through providing clear examples (one-shot, or few-shot training and so on). That said, we are somewhat skeptical (as we have written previously: ChatGPT does not have a user manual. Let’s not create one). 

Third, one solution that has been offered is that increasing the context window  would generate more consistent results. However,it is not clear that is truly the case, though to be fair the jury is out on that. 

Fourth, given the differences in responses across versions of GPT and the opacity surrounding which models ed-tech companies use, it becomes harder to develop any guidelines. Moreover, since these “black box” models are updated continually, with little or no forewarning, this complicates any guidelines we may seek to create for the use of these tools in educational contexts.

Fifth, we need not see this variability as always being a bad thing. In fact it can be the basis for creativity—since developing variations have always been a key part of the creative process. In addition, as researchers, digging into better understanding technologies, variability may be a key step forward. As Robert Shiller wrote:  

… I find great excitement in discovering the complexity and variability of the world we live in, getting a glimpse into the deeper reality that we mostly ignore in our everyday human activities. Robert J. Shiller

Finally,this concern raises significant questions about consistency in grading if these tools were to be used in real evaluation situations. This becomes more salient as we consider the news that the state of  Texas just recently announced that they would use automatic grading to evaluate students’ open-ended responses, and OpenAI’s recommendation that ChatGPT Edu could be used to grade student work.  We end with how Robert ended his post. We couldn’t say it better, except we would add education to the end of the sentence:

“These inconsistencies underscore the critical importance of human judgment and the potential risks of over-relying on AI in domains like public safety and regulatory compliance.”

A few randomly selected blog posts…

Pomes on creativity

I am in Plymouth, England, for a week, as a part of our off-campus MAET program. I spent time today with the first year cohort, talking with them about creativity in teaching (with our without technology). One of the short (5-10 minutes) activities they completed...

Tweaking the design

I have been blogging pretty seriously now for 10 months now and am quite enjoying it. I have made some changes to the design of the site that may be worth explaining. As I have blogged over the past few months, I have come to realize that I typically make three kinds...

Evaluating creative learning environments: New instrument

Evaluating creative learning environments: New instrument

Note: There is a more recent, May 2023 post (Scaling up the SCALE instrument) that offers an update on other researchers who have utilized the instrument for their own research. Creativity is a key educational goal and essential 21st century skill. That said, much of...

Cognitive psychology of science: Old article

Cognitive psychology of science: Old article

Science ambigram with 180-degree rotational symmetry This chapter, published back in 1998, focused on the cognitive science of science. I realized today that I had not uploaded this article onto my website. So, better late than never, here it is. But before jumping...

Reflections

Reflections

- afternoon walklingering on the shore linetime for reflection - - Reflections © Punya Mishra. All photos taken with my iPhone, over the years. (published 2/27/20, revised with new photos 3/16/20) On Reflection: Haiku by Catherine from her website: Still Standing on...

Txting develops spelling skills, how gr8

Scott Graden is Superintendent of Saline Area Schools and a blogger. He recently posted about a study that indicated that texting helps students develop vocabulary skills. Though he was skeptical of the finding, I am not sure I was as surprised. He cited a news story...

A different language

I have always been interested in how we use words to capture intangibles. For instance wine connoisseurs have developed a specialized language (which sadly is quite opaque to me) to explain to each other characteristics of wine. So the words "fruity" and "dry" have...

The futures of higher ed with Phoebe Wagner

The futures of higher ed with Phoebe Wagner

The Center for Science and the Imagination at ASU runs a series of short stories and virtual gatherings that explore issues related to transformative change. Essentially they solicit and publish a (super-short) short story that explores “themes of community,...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *