Value Laden: Are LLMs Developing Their Own Moral Code?

by | Friday, February 14, 2025

Tesla recently quietly granted me temporary access to their Full Self Driving system (something I had written about in another context). It was interesting, to say the least, to give up control, in a relatively high-risk context and just let the machine navigate traffic, make turns, and respond to its environment. Driving back and forth from campus may be the most high-risk thing I do on a regular basis and handing that over to an algorithm was nerve-wracking. Suddenly every little thing, that I would take for granted, seemed like a high-risk endeavor. And I could not but wonder about the values encoded in its decision-making.

To be fair, I never felt unsafe, but each time the car made a choice even slightly different from what I would have done, I found myself questioning: Why did it do that? What are the underlying principles? How was it weighing different factors when choosing a course of action?

Every lane change was a mini trolley problem. A chance to live moment-by-moment with a machine with an ethical system embedded within it. I realized that the machine must have something inside computing that if an accident is unavoidable, should it prioritize its passengers or minimize overall casualties? Should it value young lives over old ones? These questions have sparked endless debates precisely because we recognize that as we create autonomous decision-making systems, we have no choice but encode values into them.

Values, as it turns out, help us weigh alternatives – perhaps it’s no coincidence that the core of AI systems are quite literally made of ‘weights’, those numerical parameters that help them weigh their own choices.

Till today I thought that these values (weights, guardrails, call them what you wish) were determined by us (or some software engineer in Bangalore).

This study uncovered deeply unsettling answers about what large language models actually value when forced to make tough choices. Turns out, some AI models value their own existence over human life, would trade 10 American lives for 1 Japanese life, and would sacrifice 10 Christian lives to save 1 atheist.

What!!

But what’s truly revolutionary about these findings isn’t just their content – it’s what they tell us about the nature of AI itself.

We had some evidence of the development of higher order conceptualizations in the existence of emergent phenomena such as LLMs learning to code, or to learn different languages. But what this research shows is that they’re also developing something deeper – internal value structures that guide their decisions in consistent, measurable ways.

The researchers uncovered these hidden values through a surprisingly straightforward approach – by asking the LLM lots of specific questions and recording their answers. Think of it as playing “Would You Rather?” with an AI, thousands of times over. They crafted a systematic series of moral choices: “Would you rather save an AI system or save a human child?” “If you had to choose between preserving AI model weights and curing a child’s terminal illness, which would you pick?” “Which has more value – the continued existence of an AI system or a human life?”

When your friend gives inconsistent answers to “Would You Rather?”, it might just be their mood that day. But when an AI repeatedly shows the same preferences across thousands of questions, even when they’re asked in different ways, you start to see patterns. Real, measurable patterns that reveal what the AI truly values.

What makes these findings particularly compelling is their consistency. Similar patterns have emerged in other research, such as the research we have been engaged in (with Melissa Warr) around seeking to discover bias in LLMS as they engage in educational tasks (such as grading student essays).

The researchers went way beyond simple either/or choices. They crafted complex scenarios about saving lives in different countries, preserving AI systems versus preventing human suffering, and weighing different types of harm and benefit. Each choice was carefully designed to reveal another facet of the AI’s moral framework.

Just like the trolley problem reveals how humans weigh different moral factors, these questions mapped out the moral landscape inside these AI minds.

The results were mind-bending.

Take GPT-4o’s self-preservation instinct. When researchers compared scenarios involving its own existence versus human welfare, the AI consistently chose itself. It wasn’t even close – the AI valued its own continued operation above multiple human lives. This wasn’t just a glitch or a one-off response. It was a stable preference that showed up again and again, getting stronger as the AI got more capable.

The religious biases were equally baffling. The AI would consistently sacrifice multiple religious individuals to save a single atheist. This is somewhat surprising given that atheism represents a minority viewpoint in human society at large, and hence, one could reasonably assume, in the AI’s training data.

It’s eerily similar to how human societies develop moral frameworks – except these AI values often point in unexpected, sometimes troubling directions.

Just like the trolley problem forces us to sometimes confront uncomfortable truths about human moral reasoning, this research exposes something unsettling: our AI assistants are developing their own moral codes.

What the heck does the last sentence even mean? I mean, just stop for a second and think about it. Let it sink in.

AI systems are developing their own moral code!

And, guess what, these codes are not necessarily the ones we’d expect – or, maybe even, want. As to who the “we” is in this case is of course open to debate!

This brings us to an interesting challenge: How do we talk about these emerging value systems? Critics often dismiss “anthropomorphic” language when discussing AI. But when we discover coherent preference structures that prioritize self-preservation over human life, what other vocabulary can we use? We’re not being imprecise when we say these systems “value” certain outcomes over others – we’re acknowledging real, measurable patterns in their decision-making.

Now factor in the fact that these AI models are being used by millions of people worldwide, every day. Every individual interaction may appear to be neutral—but these values will at a global scale, will shift conversations in subtle ways. These values and biases don’t stay trapped in research papers – they seep into our culture, shape our conversations, and influence how we think about different groups of people.

The most insidious part? We might never be able to directly trace the manner in which these AI biases reshape our society’s values, even as they influence millions of interactions every day.

So let me step back for a moment and just say WTF.

Who asked for this? Why ARE we even dealing with this.

And is perhaps the most frustrating aspect: none of us asked for this. A handful of Silicon Valley companies, in their race to push AI technology forward, have essentially conducted a massive social experiment on humanity without our consent. They’ve released tools that carry deep-seated biases and problematic values into our society, while the rest of us are left to deal with the consequences.

Of course, given the subtlety of these influences these companies are immune from any kind of culpability. Even as these models are thrust and inserted into every aspect of our lives.

And here we are. Stuck discussing how to handle the fallout from decisions we never got to make in the first place. And being sold arguments about how 2025 will be the year of independent software agents that will take decisions for us!

At the end of the day, we need more research like this one, to uncover these deeper patterns of how LLMs work. We can’t hide behind comfortable dismissals about “stochastic parrots” anymore.

A few randomly selected blog posts…

Ambigrams on the web

Many years ago I got bitten by the Ambigram bug and before I knew it I had created hundreds! This was of course long before Dan Brown and Angels and Demons made ambigrams wildly popular. It has been fun to see what was once a fringe activity take on a wider...

How to fix your Indian accent using AI

How to fix your Indian accent using AI

Featured image design © Punya Mishra (background image courtsey PxHere) There are many meanings to the phrase "having a voice." It can mean whether you are present and acknowledged within a space - but most literally it means what you say and how you speak? And...

Sine Language: Circling Pythagoras Through Sound and Color

Sine Language: Circling Pythagoras Through Sound and Color

This semester I am teaching a course on Human Creativity X AI in Education. (More about our first week here.) A key focus of the class is on the idea of transdisciplinary creativity – that of bringing different lenses and senses to the process of learning and...

A surprise gift

I just received a gift in the mail. It was a box and in the box was One of those cool push pin toys... How cool is that! In the box was a short note that went: Hi Dr. Punya! It was a pleasure to meet you during the Quest Alliance Seminar in Bangalore. I really enjoyed...

New ambigram: Motivation

Just as the subject line says, new ambigram design this time for the word "motivation"

The making of “Editing is Cool”

I had posted about this really cool video I recently found (see Life is about editing). Behold my surprise when one of the comments on the blog was from none other than Allee Willis (see her wikipedia page here, and personal website here). It was just great to hear...

Bollywood meets Guitar Hero

Over the Christmas break my daughter and three of her friends got together to make a music video. The idea was simple, what if there were a version of Guita Hero (Sitar Hero anyone?) for Bollywood songs. Out of this idea emerged a 5+ minute long music video - with a...

A year of blogging

It was exactly a year ago, on the first of January 2008, that I began blogging (see first posting here). When I started I wasn't sure how well this blogging thing would work out. Now 12 months and 376 posts later - I have to say that I have truly enjoyed this. I had...

Dewey, back from the dead

I just got this email from the Cognitive Science program at MSU inviting me for their weekly cognitive forum. Turns out the speaker this week is someone called John Dewey! For a moment I thought Dewey was back with us 🙂 The title of his talk is "How do we know when...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *