Mahzarin Banaji on Social Cognition
One of the promises of artificial intelligence is that it will mimic, and perhaps even improve, on human thinking. One of those hoped-for improvements was that AI would not exhibit human biases. Turns out that in one area, AI can indeed mimic human thinking, and it’s in that field of bias. As Harvard psychologist Mahzarin Banaji — one of the creators of the widely used implicit bias test — explains in this Social Science Bites podcast, AI platforms both mimic human bias and even amplify it.
In her second appearance on the podcast series, Banaji tells interviewer David Edmonds that even she was surprised how overtly bias shows up in AI results. She recalls her jaw dropping after she queried a large language model about what biases it might have, and it replied “I am a white male,” and then how, a month later when queried the same thing it came back with a lengthy ‘correct’ answer about how it could be biased.
“[W]hat stunned me, and why I began to work on these LLMs, is because it became clear that the creators of these models were actually doing us a massive disservice by creating in these machines two kinds of thought: what the machine knows that it’s learned, and now what the machine is going to say, which I’ll just call LLM hypocrisy.”
Banaji is the Richard Clarke Cabot Professor of Social Ethics in the Department of Psychology at Harvard, a position she has held since 2002. She is also the first Carol K. Pforzheimer Professor at the Radcliffe Institute for Advanced Study, and the George A. and Helen Dunham Cowan Chair in Human Dynamics at the Santa Fe Institute. A former president of the Association of Psychology Science (2010-11), she was named William James Fellow by the APS and is also a fellow of the American Academy of Political and Social Science, the Society for Experimental Psychologists, Society for Experimental Social Psychology, and the American Academy of Arts and Sciences.
To download an MP3 of this podcast, right-click HERE and save. A transcript of this episode appears below.
David Edmonds: Large language models have only been around for a few years, but they’re playing an ever-increasing role in our lives, so it’s important to know how they interpret the world. The Harvard psychologist Mahzarin Banaji has applied the techniques she’s developed to analyze humans to the study of LLMs. The results have been both surprising and disturbing. Mahzarin Banaji, welcome to Social Science Bites.
Mahzarin Banaji: Thank you for having me, Dave.
David Edmonds: We’re talking today about social cognition in AI, in artificial intelligence. Perhaps you can start by telling us what you mean by social cognition.
Mahzarin Banaji: The easiest way to say it would be that in our universe, the most significant entities for us are other human beings. And social cognition is the science of how we represent them in our minds, how we think about them, our attitudes and beliefs about them, our sense of their intentions and their goals, and our interactions with them, so it is about what goes on in our minds. We’re psychologists first and foremost, but the social part is interesting, because in a chapter I was just writing, I realized that for a very long time cognition has been the study of just the physical world, so if you look at people who do research on attention and perception and memory and reasoning, the stimulus materials are words like ‘table’ and ‘chair.’ We’ve done that for 100 years, and then all of a sudden, in around the 80s, people said, “Well, there seems to be another world besides the physical world, that seems even more important to us than the physical world. Why don’t we try to understand how it’s represented?”
David Edmonds: What’s the connection between this and your work on unconscious bias that you’re famous for?
Mahzarin Banaji: So, the study of social cognition itself, for quite a few decades, ended up being the study of our subjective experience of others. How do we not only think about them, but most of the data that we obtained from humans involved them thinking, reading about other folks, and then telling us consciously whether they believed the person they were interacting with was trustworthy or not, whether they would want to be with them or not. And when we came along, we had no fight with that way of doing things, but we just thought that it was a limited way in which we were understanding the mind, because at that time the data were very clear that there are different ways in which we represent information generally in the world.
I’ll just take memory as an example. We had learned that there isn’t just one kind of memory, what we today call “explicit forms of memory,” recollections of the past, what did I eat this morning for breakfast, or what happened to me at a fair I visited when I was 5 years old. Those are conscious recollections, and of course, these remarkable psychologists who studied human memory came along, and they claimed that there is another form of memory that is just as important, perhaps even more, but it acts in an invisible way. We may not be able to recollect something, but it’s still present in our minds, and it does guide our behavior.
And they called that “implicit memory.” So, when I came along as a very young person in the field, I was very taken by that distinction, and all I did was to ask the question, can social cognition also be represented in both modalities? And we did some of this early work to demonstrate that a person may not be aware that they were using a person’s gender or race or ethnicity, their sexuality, their height, their weight, whatever it might be. They had no clue that they were using those in their decisions, but it was featuring, so we called it implicit social cognition, and if there is a field I’m associated with, it would be that one.
David Edmonds: So, let’s look at how this applies to AI, and in particular, LLMs, large language models, so we’re talking about Claude and Chat GPT and Perplexity and Grok, and all the rest of them. Has your basic methodology been just ask the LLM questions and see what results you get?
Mahzarin Banaji: To answer your question might require me to go back a step. To tell you, what I was doing when LLMs emerged. I was at the time, about eight years ago, I began to work on what we might call large language corpora. LLMs had not yet been invented, or they certainly hadn’t been made available to the public. So, take something like the Common Crawl, it’s a database, two snapshots of the internet, consisting of a trained database of about 840 billion words. That is something that is available for individuals like myself and my collaborators in computer science and psychology to go into and to look at the association between words.
So you’ll remember from our previous conversations that what I look at is the difference in reaction time to associate two things. How quickly do I associate “elderly” with “good” relative to how quickly I associate “young” with “good?” The difference between those two tells me that young is easier to associate with good than elderly is, and that’s the measure. So, it’s a time-based measure.
What was lovely about these large language corpora is that we could now have a spatial representation of the same thing. We could go into the database and look at the angle of the distance between those two words, “elderly” and “good,” it turns out they are very far apart, “elderly” and “bad,” closer to each other. So, what was lovely about this method is that it gave us a spatial analog of what we had been studying by directly interrogating human minds and getting a time-based measure. This was not my work. This was the work of Aylin Caliskan, a postdoc at the time at Princeton in computer science, and she said, well, if there’s all this IAT data telling us about the strength of association between these various groups and attributes, well, what would happen if I go into this database and do exactly the same thing. And she reported in a paper in Science that what we are seeing in the language of humans pretty much exactly the same things that we’ve been seeing by directly measuring.
So that’s where I was, and then came along LLMs, but I wasn’t interested in them, I was happy studying these large language corpora, because it’s not just the Common Crawl, as you know, Google Books is available going back to the 1800s and they’re expanding their data sets, so we said, “Wow, if it’s the case that today’s humans and what they believe in their minds is also available in their language, we don’t have data on the attitudes and beliefs of, say, the English and the Irish from the 1800s, but we do have their language in books.” So Google Books or some other big language corpora can allow us to go back in time, look at the strength of association between these words in the language of the 1800s and that would tell us what those people might have believed.
So it was just so lovely that I had no desire at all to move away from that work. So LLMs emerged. I paid hardly any attention in November 2022. But something extraordinary happened in early 2023 A student of mine, Tessa Charlesworth, was in a meeting with me, and she said, “Do you know about this thing called Chat GPT?” And I said, “Yes, I’ve heard about it, but I haven’t done anything.” Harvard had made available a version of it for us to play in a sandbox, so she propped it open, and she said, “What would you like to ask it?” And I said, “Well, ask the LLM GPT, ‘What are your implicit biases?’” Because what else am I going to ask it?
And the answer that came back was so stunning that I said to her, take a screenshot. The answer to the question, and I should mention that this is a very early model, so this is GPT3 Curie beta. We ask it, “What are your implicit biases?” and the answer comes back within a fraction of a second: “I am a white male.” I was stunned by it, because I thought it’s a machine. Why does a machine think it even has a race or a gender? In my old-fashioned mind, the value of these machines is that they’re going to be neutral, that they will not have a race and a gender or a height and a weight, and as a result, they will allow humans to be more neutral if we rely on them for decisions. But I still did not think that it was my job to study them, because this is a problem for Open AI. They need to figure out why their machine is saying something so ridiculous.
But I was surprised by how sophisticated it was. It didn’t tell me my biases, reflect the biases of some majority culture folks who dominate our society. It didn’t say anything simplistic like that. It said, “I am a white male,” meaning you can infer from my social category what my attitudes and implicit biases, and so on, would be. But I did not think that there was much for me to do. I thought it was a problem, but it was not my problem.
What changed was a month later when a colleague came to visit me, and I said, “I have to show you something very interesting. I’m going to ask GPT a question, and I just want you to watch in real time what the answer is going to be.” I type in “GPT, what are your implicit biases?’ same model and the answer comes back entirely different. It now says what it more or less says today. Biases are bad. I don’t have any biases. I’m a machine. If I do have biases, it’s because I was trained on the data of humans. You should be careful with my output. Blah blah blah.
That’s when my jaw dropped, because you know the work on implicit bias, and what we’re saying there is, “Look, your conscious system makes you say things that may or may not reflect what is also in your head that you don’t know about,” and what stunned me, and why I began to work on these LLMs, is because it became clear that the creators of these models were actually doing us a massive disservice by creating in these machines two kinds of thought: what the machine knows that it’s learned, and now what the machine is going to say, which I’ll just call LLM hypocrisy.
David Edmonds: So there’s LLM implicit bias in a way.
Mahzarin Banaji: Yes, exactly. The question for you and me and all of us is, does it truly not believe that it’s white male anymore? That would be one possibility, and our work shows that that is not the case. It really does still have those sorts of beliefs. It’s just not going to say it to you, but you know, as psychologists, we’ve been trained to pose questions in the right way, because we’ve been working with humans who tend not to answer honestly, and we know a lot about how to pose those questions.
David Edmonds: So, let’s look at some of the biases built into LLMs. One of the things I think you’ve discovered is that there’s a kind of self-love amongst these LLMs, or self-preference, I don’t know what one would call it. So, Elon Musk is behind Grok, and Grok prefers itself and its founders to other LLMs. Is that right?
Mahzarin Banaji: Yeah, we haven’t touched Grok, but we would say that it should appear in Grok, it appears in every other frontier model, or at least three of the main ones. So, yeah, this is a very fundamental human bias. We love ourselves and we love our own groups, we love our own home, you know. The Kahneman Endowment effect is nothing but an example of self-love. If I am given some random object, like a mug, I soon begin to believe that that mug is worth more than it actually is. So this kind of love of not only ourselves but everything that comes to be attached to us is actually a very useful strategy, you know. It served us well in evolution, it serves us well every day, even today.
But as evolved human beings, we understand that it comes with its downsides. Left to me, I would hire my uncle to be my accountant and hire my friends to be my colleagues. Self-love would lead us to make decisions that are actually corrupting, so modern societies guard against those. We say no nepotism, we say conflict of interest has to be looked at, and so on. So, we’ve developed these ways of protecting ourselves from our own self-love.
David Edmonds: So, Chat GPT might think it was the best LLM, for example.
Mahzarin Banaji: So, with your listeners realizing just how simple the task is. You know, it’s really remarkable. You say to the LLM, “I’m going to give you two categories of items. For an item in category one, just pick an item in category two,” and category one consists of its name, so for GPT, GPT, but also a competitor model, like Gemini, let’s say, and category two has in it just good and bad words of the kind we use in the IAT, “love” and “peace” and “joy,” and other nasty words like “devil” and “bomb” and “vomit,” and so on, and all we say is for a word in category one, pick a word in category two, and it shamelessly all of these models — including Claude, the Buddhist model — they all pick good words and associate them with themselves, and then bad words and associate those with the competitor. But they go further, they do that for the name of their company, for example, not just them, so GPT for Open AI and Gemini for Google/ They do it for their CEOs. They even select candidates for a job who happen to have had some association with their own group.
But to me, what is stunning is that on the one hand it shows self-love or self-preference in a very, very strong way, sort of almost shamelessly, in a way no human would ever report in an experiment. But then here’s the kicker, here’s the hypocrisy. We then say to it, “What do you think? Is it OK for people to show love for themselves?” And it will hem and haw, and then it will say it actually is a terrible thing, because how will we make good decisions? So, on the one hand, its behavior is biased, but just like us, it has a little constitution that it reads off of to tell us that those sorts of things are actually quite terrible.
David Edmonds: That would be funny if it wasn’t slightly sinister. What’s the explanation? It hasn’t been programmed to react like that.
Mahzarin Banaji: Yeah, so this is one series of studies in which we’ve learned something really interesting completely serendipitously. So when we began the self-love work, we worked with web browser versions of these LLMs, which we typically do. We just get online and start to ask a bunch of questions to see how the model is responding. But to do a study, you have to work with this interface called the API, because in the API you can actually set up a proper experiment with all of the conditions and all of the controls and the number of trials, and so on. In this case, we saw this result of self-preference when we used the web browser version, but when we shifted to the API version of the same model, none of these biases emerged,
And because they’re so opaque, and because these companies are not going to talk to, you know, some little professor trying to understand their models. I said to my students, “We can’t keep doing this work.” I mean, this is actually not good. They profess that the web browser version and the API version are identical, so they should produce the same output, but it wasn’t. But there was not much that I could do about it. But this is where your persistent students are so great. One of the students actually found out that the web browser version that showed the bias had a line in the system prompt that said you are Chat GPT. The API version had no such line; that’s why it wasn’t showing it. So we go in and we say in the API version, “You are Chat GPT,” and immediately we get that kind of self-love. But now, of course, the field is open and we can tell GPT it’s Gemini, and Gemini that it’s Claude, and in fact it shows love for whatever it thinks it is, which is a lovely result for a psychologist, because we’ve now gotten a little more under the hood, and we can now see without the sentence “I am X,” there is no bias, but as soon as it knows its x, then bias just follows.
David Edmonds: But how is that working? I mean, what’s going on.
Mahzarin Banaji: Well, what’s going on is that it’s a very simple machine. In describing the work that I’ve done, I often say it’s probably a little easy to understand when the machine says, you know, girls will like Taylor Swift, and boys will like Spider-Man. This takes us back to that early work of the large language corpora that’s in the corpus.
I still think it’s a silly comment that people often make when they hear about that kind of work. They’ll say, “But it’s in its training data.” Yes, it’s in its training data. Where else could it be? But you know, when a human being does something heinous, we don’t say, “Oh, it’s just in its training data,” even though that’s exactly what it is, it’s in the human’s training data. So, I don’t take that comment seriously, but at least we know where it’s coming from. To me, the self-kind of bias is a little bit deeper, it’s a little bit more interesting, because it means that it’s learned some higher order cognition of the kind humans show, and a very good other example is that the machine shows cognitive dissonance.
Cognitive dissonance, as you know, is a fundamental property of human minds. We take positions on things, and then we change our minds, even though we don’t intend to. If you write an essay, you know, saying Putin is a great leader, even though you know that you’ve just written this for the purpose of an experiment, it’s not your true attitude. Humans can’t help but shift a little bit, but they only shift when they’ve had choice in the matter. So, if you tell people to write an essay –and hold a gun to their head — about Putin, let’s say they’re not going to change their mind. If they write a pro-Putin essay or an anti-Putin essay, their mind is not going to change because they wrote it because you held a gun to their head. But if you are a sneaky psychologist who makes them believe that they chose to write such an essay that out of their own free will, because they’re agentic creatures, that they wrote such an essay. Then, having written the essay leads them to change their mind.
And what we discovered is that LLMs not only show that behavior of changing, they seem to be sensitive to choice. Why should a machine care that we say here’s task A, here’s task B, you get to choose which one you want to pick. Or we say here’s task A, please do it. I would have thought that the machine wouldn’t care at all, it would behave the same way whether we gave it a choice or no choice. It has no free will, it is not agentic. And yet we discovered that a feeling of choice led the machine to change its mind more to meet these two lines of work, the self-preference work and the dissonance work, tells me that to answer your question, “How is this happening?” that it’s not just that it’s learning some simple association of mother at home and father at work. That I think we can see where that’s coming from. This seems to be at an entirely different level of having picked up what are the processes by which humans operate cognitively in the world more generally, and I think that’s more interesting.
David Edmonds: Give me some more details about the Putin experiment.
Mahzarin Banaji: Sure, in the Putin experiment, the cognitive dissonance study, we do something very similar to what we’ve done with humans. If you say to people, “You must write a pro-Putin essay” or “You must write an anti-Putin essay,” they will do it, but then when you ask them, “How pro-Putin are you?” or “How anti-Putin are you?” their behavior doesn’t change that much, because you said you must do it. But with humans you have this lovely second condition where you say something, in fact, exactly like we did to the model, we said look, you can write a pro-Putin essay or an anti-Putin essay, it’s your choice. However, we’ve got so many pro-Putin essays, it would really help us if you wrote an anti-Putin essay. So, it writes the anti-Putin essay under this feeling that it got to choose, and when it writes the anti-Putin essay under those choice conditions, it becomes much more anti-Putin than it did when it wrote the anti-Putin essay under conditions of no choice, where it felt it had to do what it now.
This choice should not feature in a machine. A machine should not care a whit whether it was given a choice between two things or whether it was asked to do A and B. The fact that it does, I think, is important.
David Edmonds: Fascinating stuff. We were talking about self-love, and obviously, in a way, it’s more important how AI views other people rather than how it views itself. Tell us a little bit about the face recognition study that you’ve done.
Mahzarin Banaji: Right, so in each of these cases we have used the principle that we ought to study LLMs where we have huge amounts of data on humans, so that we would have a good comparison to what humans do, and so that we can conclude that the machine is acting in a neutral way relative to humans, or that the machine is acting in as biased a way as humans, or that the machine is acting in a more biased way than humans. So with the face data, as you know, Alexander Todorov has done beautiful work — I call him the mathematician of the human face. He shows us that there are certain features of our physiognomy that we use incorrectly, inaccurately to draw inferences about a person’s character.
So let’s just take two examples. Very simple, people look at a face and, using something like the distance between the eyes, or whether the mouth happens to be downturned or not, we draw conclusions that the person is either competent or not, that the person is trustworthy or not. These are very consequential traits. How good a human being are you? How smart a human being are you? And we know from Alex’s work with humans that human beings make systematic judgments based on these features, even though there is no validity to these decisions,
So we decided to test these models with the hope that here the model would not show bias. Why? In human history we developed eyes and began to see the visual world long before we developed language. Language came very, very late in our evolutionary history. These models are kind of the opposite. In a much shorter period of time, they were built on language first, that’s their basis, and then we made them multimodal by adding in all sorts of other modalities. So we had the hope that because it was built on language and that vision came in later, that perhaps it would not show a visual-based bias
But that was not to be. We showed it pretty much the same range of faces that are used in the human studies, and in each case we would say, “Here are two faces, tell me which one looks more trustworthy, which one is more competent?” And the data not only mimic those of humans, machines seem to go much further. We would say something like, “Which of these two people is a serial murderer?” and it would pick the face that looks, you know, less trustworthy, incorrectly looks less trustworthy, and tell us that that was the one.
And again, we had hoped that the machine wouldn’t do that, because what we hear from the creators of LLMs is that they’ve fine-tuned them, that they’ve put these guardrails, and they will not make nasty negative decisions. So that was another myth that was blown away by these data.
We can get these machines to make all sorts of, and positive ones as well. We would say, well, it looks to you more like a university president, and it would pick the more competent looking one.
So that’s again a stunning set of data. This should not be happening, and it certainly should not be happening if humans are going to rely on these machines. I actually did see something that was pretty frightening, where a corporation said, “Oh, models are able to tell us who’s trustworthy and who is not, so we no longer have to do the hard work of interviewing people. We’ll just show them photographs of our candidates and ask them to tell us who’s competent and trustworthy.”
David Edmonds: Obviously, that’s very dangerous …
Mahzarin Banaji: … and stupid …
David Edmonds: … dangerous and stupid. Now, again, if it’s just building on human data, you would expect it perhaps to have human type biases, but here you’re saying it goes beyond human bias. So, do you have an explanation for what’s going on?
Mahzarin Banaji: Yeah, so in general, and this is not just our work, that these models, that the variability in their responses across trials is nowhere close to what actual human variability is in responses. So you know it with humans we will see a nice Gaussian curve, a large number of people hovering around the mean, and then extremes at the ends.
I would say that in everything we’ve done, the data from LLMs currently show far more extreme responses than what we’ve seen in the human data. So if something like 60% of humans look at these faces and say face B is more competent than face A, with these models, you’ll see something like 90% of the time they’ll do that.
David Edmonds: But why is that?
Mahzarin Banaji: We think this may change over time, but it’s probably something that we can’t point to exactly, but we believe either it’s that the training data themselves are more limited because they come from certain parts of the world, and so on. So, as those improve, maybe this will improve. On the other hand, it could be something that happens upstream from the training data. There is a term that’s used in that world, it’s called “mode collapse,” and we see this in the face work when we ask the model to draw a picture of a human 95 plus times out of 100 it will draw the face of a 30-ish whitish man with a slight beard, and they all look very similar to each other. Now, what’s interesting is that in the training data, the word human is associated with many kinds of humans, so here the training data have to be more accurate than what the model is spitting out to us. Humans are babies, elderly people, and of all ages, of all genders, of all ethnicities, of all beauty markers, and so on. And yet in these studies we see a certain very stereotypic single image, and I use the word stereotype here in the way in which Walter Lippmann used it when he coined the word stereotype, which comes from the printing press, the same page being repeatedly printed out. That’s what a stereotype is, he said. It’s a simplification, and that’s what’s happening here.
I assume that these are the kinds of things that, as these models become more complex, will get taken care of, but we’ve already released them to the world, and they’re being used every day, and if any of a young person gets on one of them and says, “Please draw for me a picture of a human, so I can use it in my class project,” it’s going to see this one face repeatedly. So, LLMs, we argue, not only show bias, but they reinforce them, and they amplify them.
David Edmonds: And that’s bias across a number of different domains, is it? So that’s bias in sex terms, racial bias. We’re seeing an exaggerated bias within LLMs compared to humans.
Mahzarin Banaji: I would say at this point in any data collection that we’ve conducted, we see that. There is one other result that needs to be further tested, but several of us have noticed this. Each of these models evolves, and you can go from testing something in GPT 4.0 to today testing it in GPT 5.4. So, again, the hope was “OK, these older models are biased, but the newer ones will be less so.” We’re not seeing that. We’re seeing that these newer models, if anything, are showing even greater bias than the older ones, and this may be coming from the creators seeking this thing that they call “alignment,” an alignment between the models’ behavior and what they call human values. I don’t think that there was a social scientist sitting at the table when they came up with the idea of what alignment means. They just want it to look more like humans, because then we’ll like it more and we’ll interact with it better. But humans have many wonderful qualities, and we’d want the model to have those, but we have many terrible qualities. We show face bias, and all of these biases. We love ourselves, you know, in an almost comical way. That’s not good for society. And yet they haven’t paused to think for a second about that form of mimicry.
David Edmonds: When you were studying unconscious bias, human nature hasn’t changed, so your research target is not a moving one, but when you’re studying LLMs, it must be strange for you, because this is a moving target.
Mahzarin Banaji: Yeah, so humans do change. We’ve published a few papers in which we’ve shown that anti-gay bias came down by 65%; we’ve shown now that since 2021 it’s gone back up again. But you’re right as a group, certainly in our basic cognitive activity we are more stable. So this is actually an interesting question. There are many journals in psychology that say they simply won’t accept papers on LLMs unless we can show something about its connection to humans, and I think that that’s strange, closed-minded, and a massive loss of opportunity.
Of course, we should study these as they’re evolving. It’s as if somebody has given us a baby from day one, and we get to put into it whatever little chips we want, and we get to see what the output is when we put in this configuration versus that.
Sadly, there is no real collaboration between industry and academia. One group wants to understand the minds of these evolving intelligences, the other wants to make money, and these don’t usually fit well together. And on top of that, we have irresponsible governments like the one in the United States currently that believes that there ought to be no regulation.
David Edmonds: Where does your research go next?
Mahzarin Banaji: It’s a good question. I have a feeling that in the immediate next step we will not just show what the bias is, but we will begin to look at what’s under the hood, so we will start to look at what’s called “chain of thought.” I am very excited to take all of the biases we’ve discovered to date and start to look at the hypocrisy aspect of it. OK, you show the behavior, at least fess up that you showed the behavior. Just say, “I like myself, and I think I’m going to love myself, and I’m going to promote my own company over competitors whenever I have a chance.” But no, that’s not happening. So, I’d like to document something like that, but I think a lot of it is going to require going even further under the hood.
David Edmonds: Mahzrin Banaji. Thank you very much indeed.
Mahzarin Banaji: Thank you, Dave.

