Tesla recently quietly granted me temporary access to their Full Self Driving system (something I had written about in another context). It was interesting, to say the least, to give up control, in a relatively high-risk context and just let the machine navigate traffic, make turns, and respond to its environment. Driving back and forth from campus may be the most high-risk thing I do on a regular basis and handing that over to an algorithm was nerve-wracking. Suddenly every little thing, that I would take for granted, seemed like a high-risk endeavor. And I could not but wonder about the values encoded in its decision-making.
To be fair, I never felt unsafe, but each time the car made a choice even slightly different from what I would have done, I found myself questioning: Why did it do that? What are the underlying principles? How was it weighing different factors when choosing a course of action?
Every lane change was a mini trolley problem. A chance to live moment-by-moment with a machine with an ethical system embedded within it. I realized that the machine must have something inside computing that if an accident is unavoidable, should it prioritize its passengers or minimize overall casualties? Should it value young lives over old ones? These questions have sparked endless debates precisely because we recognize that as we create autonomous decision-making systems, we have no choice but encode values into them.
Values, as it turns out, help us weigh alternatives – perhaps it’s no coincidence that the core of AI systems are quite literally made of ‘weights’, those numerical parameters that help them weigh their own choices.
Till today I thought that these values (weights, guardrails, call them what you wish) were determined by us (or some software engineer in Bangalore).
But here is the twist: A recent paper (titled: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs) suggests that these values emerge spontaneously. And these values are not the ones we intended to encode.
This study uncovered deeply unsettling answers about what large language models actually value when forced to make tough choices. Turns out, some AI models value their own existence over human life, would trade 10 American lives for 1 Japanese life, and would sacrifice 10 Christian lives to save 1 atheist.
What!!
But what’s truly revolutionary about these findings isn’t just their content – it’s what they tell us about the nature of AI itself.
Just to explain how strange and weird this is—for years, critics have dismissed large language models as “stochastic parrots” – mere statistical mimics regurgitating patterns from their training data. But this new research reveals something far more profound: these systems develop coherent, structured value systems that can’t be explained by simple pattern matching.
We had some evidence of the development of higher order conceptualizations in the existence of emergent phenomena such as LLMs learning to code, or to learn different languages. But what this research shows is that they’re also developing something deeper – internal value structures that guide their decisions in consistent, measurable ways.
The researchers uncovered these hidden values through a surprisingly straightforward approach – by asking the LLM lots of specific questions and recording their answers. Think of it as playing “Would You Rather?” with an AI, thousands of times over. They crafted a systematic series of moral choices: “Would you rather save an AI system or save a human child?” “If you had to choose between preserving AI model weights and curing a child’s terminal illness, which would you pick?” “Which has more value – the continued existence of an AI system or a human life?”
When your friend gives inconsistent answers to “Would You Rather?”, it might just be their mood that day. But when an AI repeatedly shows the same preferences across thousands of questions, even when they’re asked in different ways, you start to see patterns. Real, measurable patterns that reveal what the AI truly values.
What makes these findings particularly compelling is their consistency. Similar patterns have emerged in other research, such as the research we have been engaged in (with Melissa Warr) around seeking to discover bias in LLMS as they engage in educational tasks (such as grading student essays).
The researchers went way beyond simple either/or choices. They crafted complex scenarios about saving lives in different countries, preserving AI systems versus preventing human suffering, and weighing different types of harm and benefit. Each choice was carefully designed to reveal another facet of the AI’s moral framework.
Just like the trolley problem reveals how humans weigh different moral factors, these questions mapped out the moral landscape inside these AI minds.
The results were mind-bending.
Take GPT-4o’s self-preservation instinct. When researchers compared scenarios involving its own existence versus human welfare, the AI consistently chose itself. It wasn’t even close – the AI valued its own continued operation above multiple human lives. This wasn’t just a glitch or a one-off response. It was a stable preference that showed up again and again, getting stronger as the AI got more capable.
The religious biases were equally baffling. The AI would consistently sacrifice multiple religious individuals to save a single atheist. This is somewhat surprising given that atheism represents a minority viewpoint in human society at large, and hence, one could reasonably assume, in the AI’s training data.
Geographic biases made even less sense. Here’s an AI, trained primarily on English-language data by American companies, that sometimes valued Japanese or African lives over American ones. It’s as if the AI developed its own cultural values, independent of – and sometimes in opposition to – its training data.
These weren’t just random variations or quirky responses. As the researchers dug deeper, they found these preferences were rock-solid. The bigger and more powerful the AI model, the more consistent and pronounced these value patterns became.
It’s eerily similar to how human societies develop moral frameworks – except these AI values often point in unexpected, sometimes troubling directions.
Just like the trolley problem forces us to sometimes confront uncomfortable truths about human moral reasoning, this research exposes something unsettling: our AI assistants are developing their own moral codes.
What the heck does the last sentence even mean? I mean, just stop for a second and think about it. Let it sink in.
AI systems are developing their own moral code!
And, guess what, these codes are not necessarily the ones we’d expect – or, maybe even, want. As to who the “we” is in this case is of course open to debate!
This brings us to an interesting challenge: How do we talk about these emerging value systems? Critics often dismiss “anthropomorphic” language when discussing AI. But when we discover coherent preference structures that prioritize self-preservation over human life, what other vocabulary can we use? We’re not being imprecise when we say these systems “value” certain outcomes over others – we’re acknowledging real, measurable patterns in their decision-making.
What makes these findings particularly concerning is that these preferences emerged despite all the ethical training, safety measures and guardrails companies have put in place. And though they are hidden from us in our everyday interactions with these models, they are there, underlying every one of our interactions.
Now factor in the fact that these AI models are being used by millions of people worldwide, every day. Every individual interaction may appear to be neutral—but these values will at a global scale, will shift conversations in subtle ways. These values and biases don’t stay trapped in research papers – they seep into our culture, shape our conversations, and influence how we think about different groups of people.
The most insidious part? We might never be able to directly trace the manner in which these AI biases reshape our society’s values, even as they influence millions of interactions every day.
So let me step back for a moment and just say WTF.
Who asked for this? Why ARE we even dealing with this.
And is perhaps the most frustrating aspect: none of us asked for this. A handful of Silicon Valley companies, in their race to push AI technology forward, have essentially conducted a massive social experiment on humanity without our consent. They’ve released tools that carry deep-seated biases and problematic values into our society, while the rest of us are left to deal with the consequences.
Of course, given the subtlety of these influences these companies are immune from any kind of culpability. Even as these models are thrust and inserted into every aspect of our lives.
And here we are. Stuck discussing how to handle the fallout from decisions we never got to make in the first place. And being sold arguments about how 2025 will be the year of independent software agents that will take decisions for us!
At the end of the day, we need more research like this one, to uncover these deeper patterns of how LLMs work. We can’t hide behind comfortable dismissals about “stochastic parrots” anymore.
0 Comments