A Team Approach to Tackling the Psychology Replication Crisis

Fake Oscar statuettes
Replica Oscar statuettes in a souvenir shop. (Photo: Adarsh Upadhyay/Flickr)

In 2008, psychologists proposed that when humans are shown an unfamiliar face, they judge it on two main dimensions: trustworthiness and physical strength. These form the basis of first impressions, which may help people make important social decisions, from who to vote for to how long a prison sentence should be.

To date, the 2008 paper — written by Nikolaas Oosterhof of Dartmouth College and Alexander Todorov of Princeton University — has attracted more than a thousand citations, and several studies have obtained similar findings. But until now, the theory has only been replicated successfully in a handful of settings, making its findings biased towards nations that are Western, educated, industrialized, rich, and democratic — or WEIRD, a common acronym used in academic literature.

Now, one large-scale study suggests that although the 2008 theory may apply in many parts of the world, the overall picture remains complex. An early version was published at PsyArXiv Preprints on October 31. The study is under review at the journal Nature Human Behavior.

UnDark logo
This article by Dalmeet Singh Chawla was originally published by Undark and is reposted with permission. Undark is a non-profit, editorially independent digital magazine exploring the intersection of science and society. It is published with generous funding from the John S. and James L. Knight Foundation, through its Knight Science Journalism Fellowship Program in Cambridge, Massachusetts.

The study is the first conducted through the Psychological Science Accelerator, a global network of more than 500 labs in more than 70 countries. The accelerator, which launched in 2017, aims to re-do older psychology experiments, but on a mass-scale in several different settings. The effort is one of many targeting a problem that has plagued the discipline for years: the inability of psychologists to get consistent results across similar experiments, or the lack of reproducibility.

The accelerator’s founder, Christopher Chartier, a psychologist at Ashland University in Ohio, modeled the project in part on physics experiments, which often have large international teams to help answer the big questions. The first study going through Chartier’s accelerator included just shy of 11,500 participants from 41 different countries. Each participant rated 120 photos of racially and ethnically diverse faces on one of 13 traits such as trustworthiness, aggressiveness, meanness, intelligence, and attractiveness.

Worldwide, Chartier and colleagues generally find strong support for Oosterhof and Todorov’s original theory that valence, an indicator of trustworthiness, and dominance, a measure of one’s physical strength, drive the majority of people’s snap judgements.

But in all regions except Africa and South America, a third factor related to happiness and “weirdness,” or how strange or bizarre a person appears to be, also influenced how participants judged faces, says Lisa DeBruine, a psychologist at the University of Glasgow and a lead author on the new study.

Previous replication research by DeBruine and colleagues has also shown support for Oosterhof and Todorov’s theory. But in a study of how participants of Chinese origin judge Chinese faces, the results differed. “Dominance didn’t seem to be an important dimension for social judgements in China but competence did,” DeBruine says. In using a more diverse Asian sample in the accelerator study , however — one that included Malaysia and Thailand, among other countries — DeBruine says the dominance component was supported.

Todorov says he is surprised how well his valence-dominance model holds up in several parts of the world, since theoretically one would expect a lot more variation among different cultures and geographic locations. “The data from this large-scale replication are an incredible resource and I am extremely grateful to the lead authors who initiated the project.”

Interestingly, DeBruine says, when the accelerator project researchers analyzed their data using a different technique, they saw much more cultural diversity. In Asia, for example, dominance turned out to be relatively unimportant.

Although Chartier wants the accelerator studies to contribute valuable knowledge, his wider ambitions are much greater. “We hope that it kind of shifts the norms or kind of the expectations of psychological science,” Chartier says, “towards these larger samples, more diverse samples, more rigorous methods, and preregistration,” among other things.

Despite being branded as an accelerator, the project has needed two years to produce its first study, partly because it takes longer to coordinate within large teams and collate data from multiple locations. “Sometimes, the name is almost a curse,” Chartier says. But the project does have several more studies in the pipeline, he adds, three of which are at the data collection stage.

Even at this modest pace, each accelerator project should produce knowledge “likely to be greater than that produced by 100 typical solo or small-team projects,” says Simine Vazire, a psychologist at the University of California, Davis, who is not involved with the accelerator. “Even though it looks slow, it is actually likely to produce discoveries and knowledge at a faster rate than the heaps of little studies we are used to pumping out.”

Still, Chartier stresses the accelerator is not a replacement for small studies, which can have their own strengths. Rather, he says, researchers should be more cautious in discussing theories built upon studies that have not been widely replicated or tested globally.

One goal for the accelerator, Chartier adds, is to operate as a model for academia more widely. For instance, he notes, the network chooses which topics to study democratically. After an initial call for submissions, they anonymize applicants’ names and other key information to weed out any potential biases. The study selection committee, a group of five researchers, then assesses whether the accelerator has the bandwidth to carry out the study.

For studies that pass this stage, Chartier tracks down around 10 experts — within and outside the accelerator — to review each submission. Following the review, all accelerator members rate each project via an online survey. The selection committee decides which projects are accepted based on all the feedback and ratings.

“The collaborative model for selecting what research questions to study and how to study them is unlike anything I’ve seen in psychology previously,” says Sanjay Srivastava, a psychologist at the University of Oregon who is not involved with the accelerator. “As a field we sometimes struggle with doing truly cumulative work because everybody wants to create their own little theoretical fiefdom.”

Once it’s decided which study the accelerator network labs are going to work on, the authors often publish a registered report outlining their approach, after quality control checks from experts, but before data collection stage — a process known as pre-registration, which has become popular in psychology in recent years.

One benefit of preregistration is that it allows for expert feedback before data collection. Another, DeBruine notes, is that studies are guaranteed to be published as long as they follow the agreed-upon protocol. This would weed out the long-standing problem of publication bias, where scholarly journals publish papers reporting that a trend exists, and ignoring those that don’t. Vazire says the accelerator is also “pushing the boundaries of good scientific practice by innovating new methods that we hadn’t imagined before.”

But DeBruine says it was tricky to prepare the preregistration report for the valence-dominance study, which published in May 2018 with more than 100 co-authors. Normally, journals “ask you to have all the authors on the paper when you submit it,” she adds, but “we weren’t sure actually who all the authors would be in the end.” The final study has 243 authors.

What’s more, the researchers think many more stories will emerge as others explore the accelerator’s data to test other hypotheses about how people perceive faces. To incentivize such projects, the accelerator is giving out 10 prizes of up to $200 to answer new questions the original team didn’t address.

“The data potentially could answer so many questions,” DeBruine says. For instance: How do participants from one gender judge the opposite gender? How do they judge the same gender? And how do people from one race judge those of another?

But what the accelerator team doesn’t want is for people to run analyses on several ideas at once and only report trends deemed to be “statistically significant.” That’s because the team wants researchers to avoid publication bias by reporting not only real trends, but also expected trends that didn’t turn up. Running studies as preregistrations also fixes the problem of researchers coming up with hypotheses after already delving into the data — a frowned-upon but common practice in academia known as Hypothesizing After The Results Are Known, or HARKing.

The Psychological Science Accelerator isn’t the only project seeking to address the reproducibility problem. Other recently conducted efforts with similar goals include the Reproducibility Project: Psychology, Social Sciences Replication Project, Many Labs, and Many Labs 2, among others. But the accelerator is unique in two ways, Chartier says. First, collaborators plan to continue to work on large-scale efforts indefinitely. And second, the accelerator isn’t necessarily limited to replication studies, opening it to novel and exploratory work.

The lack of reproducibility has led to methodological reforms in psychology, says Jessica Flake, a psychologist at McGill University in Montreal and a co-author of the first accelerator study. But new incentives would also help weed out sloppy research, she adds. For instance, academics are often concerned about whether they get adequate acknowledgement for their papers. To ensure that’s the case, the accelerator clarifies how each co-author contributed with the CRediT taxonomy, a list of 14 roles authors may have played in preparation of a study.

Vazire agrees that scientists often don’t have the right incentives to produce solid research. What’s more, she says most of the people working on the accelerator projects appear to be doing so at some cost. “The model of science as lone geniuses making discoveries in their own laboratory is unrealistic for most sciences,” she says, “and having to fit that model to be rewarded leads to fewer scientific discoveries and slower progress.”

So far, the accelerator hasn’t attracted much funding and remains largely a labor of love or part of the daily job of those involved. For now, the accelerator plans to turn around roughly three studies every year, Chartier says, but could potentially aim for more with some financial support.

Vazire is impressed by what she calls the accelerator’s “no shortcuts” approach. “This is what we teach our students that science should look like,” she says, “but until recently, it almost never actually looked that way, at least in my corner of science.”

0 0 vote
Article Rating

Dalmeet Singh Chawla

Dalmeet Singh Chawla is a science journalist based in London. He was shortlisted for the "Outstanding Young Journalist" category of the Asian Media Awards. In June 2016, I was shortlisted for the "Best Newcomer" Science Journalist award from the Association of British Science Writers. He usually reports on scholarly publishing, meta-research, scientific method, higher education policy, research tools, bibliometrics and psychology.

Would love your thoughts, please comment.x