Research

Is It Time to Say Goodbye to a Fickle Friend, the P Value?

March 11, 2015 1691

(Photo: W. Carter / CC BY-SA 4.0 via Wikimedia Commons)

How should scientists interpret their data? Emerging from their labs after days, weeks, months, even years spent measuring and recording, how do researchers draw conclusions about the results of their experiments? Statistical methods are widely used but our recent research in Nature Methods reveals that one of the classic science statistics, the P value, may not be as reliable as we like to think.

Scientists like numbers, because they can be compared with other numbers. And often these comparisons are made with statistical analyses, to formalize the process. The broad idea behind all statistical analyses is that they allow the researcher to make somewhat objective assessments of the results of their experiments.

Which drug is more effective?

Scientists often conduct experiments to investigate whether there is a difference between two conditions: do people get better more quickly after taking the blue pill (condition one) or the red pill (condition two)? The most common method for assessing if the pills differ in their effectiveness is to undertake statistical analysis of where some patients were given the blue pill and some the red, and from this determine whether there is strong evidence that one color is more effective than the other.

This article by Lewis Halsey originally appeared at The Conversation, a Social Science Space partner site, under the title “Goodbye P value: is it time to let go of one of science’s most fundamental measures?”

To assess experimental results, scientists very often use a “P value” (P is for probability). This is used to show how convincing these results are: if the P value is small, they think that the findings are real and not just a fluke. In our pill example, if P is small this is considered good evidence that there is a difference in effectiveness of the two colors of pill.

Although P is never proof that there is a difference – scientific studies never prove things, they only provide a degree of evidence for them – studies with low P values are thought to be convincing, and so are not often repeated to be sure the results are correct. This might seem reasonable because there is limited money and time in science – results from a study that seem very clear perhaps do not warrant double-checking when there are new discoveries out there to be made.

P values are fickle friends

However, we have used simple models to show that the P value often varies dramatically if a study is replicated. Our models depict a simple scenario. Samples have been measured from two conditions. A statistical test called a t-test is conducted to investigate whether there is good evidence that the conditions are different, and the test result is interpreted by the generation of a P value.

The two conditions in our scenario are indeed somewhat different and so we might expect a reasonable sample size to uncover this difference. That is, a reasonable sample size will return a low P value associated with the t-test. However, when we repeat the model experiment many times over, we find that the P value varies dramatically each time.

If your friend has invited you round for dinner next week but in the preceding days keeps contacting you and giving dramatically differing arrival times, you will soon conclude you have very little idea of what time dinner will actually be. Similarly, if P varies considerably each time an experiment is conducted, this makes the P value unreliable, and a poor measure of how strong the evidence is from a single run of that experiment.

The implication is huge for data analysis –- a low P value returned from a study is likely to have as much to do with luck as it has to do with the presence of an important pattern in the data, and in turn a re-run of the experiment might well result in a very different P value. Therefore, a low P value for a single experiment cannot be taken as good evidence that there is a difference between the conditions.

This weakness could well explain why famous scientific findings from the past, central to the foundations of many disciplines, are not being confirmed now that the original studies are finally being re-examined.

These include a lack of reproducibility in cancer research, as well as the apparent loss of the phenomenon called “verbal over-shadowing” whereby people shown a face and asked to describe it are less likely to recognize the face later on than if they had simply looked at it.

So why is the P value so variable, so fickle? Unfortunately it seems that some degree of variability between the samples for each occurrence of an experiment creates an unstable P value.

Moving on

So if not the P value, what should we use to analyze and interpret our data? We argue for a fundamental shift in thinking away from asking the question “is there a difference?” and towards asking “how big is the difference?”. After all, scientists rarely want to know simply whether there is a difference between conditions.

There is always a difference, even if extremely small. It is more pertinent to ask whether the difference is big enough to be of interest, to be of importance. If the effectiveness of the red pill is just 0.01 percent greater than that of the blue pill, there is a difference between them but it isn’t noteworthy – in practice one pill color is as good as the other.

The P value can be ditched and scientists can focus instead on how big the difference is between the conditions according to their experiment. They can also provide simple-to-calculate values on how precise that difference is likely to be when generalized beyond the laboratory.

Thus once data collection has finished, scientists should focus on estimating how big the difference is in the effectiveness of the blue and red pills, and how precise this estimate is likely to be. Researchers already know about these simple concepts – effect sizes and confidence intervals – they just need to start emphasizing them, and let the P value become a thing of the past.

Unfortunately, while a smattering of journals have now started to outlaw the P value in recognition of some of its failings, recently at least one journal has also banned the use of the confidence interval, apparently because its precise statistical definition risks it being over-interpreted and misunderstood.

A reasonable counter to this point of view is that confidence intervals are a valuable tool for estimating the margin of error around our findings – they are a crucial measure when translating our sample of data collected in the laboratory into an understanding of real world scenarios, where results really matter.

Lewis Halsey

Lewis Halsey is a senior lecturer in comparative and environmental physiology at the University of Roehampton. His body of publications in the main covers four topics, all of which primarily concern vertebrate environmental physiology and energetics: the respiratory physiology, energetics and behavior of ducks and cormorants; the relationships between the behavior, ecology and energetics of wild diving king penguins; comparative analysis of diving and pedestrian locomotion across species, and the development of the 'accelerometry technique' as a method for quantifying behavior and energy expenditure in terrestrial and aquatic animals.

View all posts by Lewis Halsey

Published

March 11, 2015

Megan Stevenson on Why Interventions in the Criminal Justice System Don’t Work

By Social Science Bites

Read Now

How ‘Dad Jokes’ Help Children Learn How To Handle Embarrassment

Insights

June 14, 2024

How ‘Dad Jokes’ Help Children Learn How To Handle Embarrassment

By Shane Rogers and Marc Hye-Knudsen

Read Now

How Social Science Can Hurt Those It Loves

Ethics

June 4, 2024

How Social Science Can Hurt Those It Loves

By David Canter

Read Now

Digital Scholarly Records are Facing New Risks

Research

May 21, 2024

Digital Scholarly Records are Facing New Risks

By Martin Eve

Read Now

Analyzing the Impact: Social Media and Mental Health

Joe Sweeney 2101 Research, Research

The social and behavioral sciences supply evidence-based research that enables us to make sense of the shifting online landscape pertaining to mental health. We’ll explore three freely accessible articles (listed below) that give us a fuller picture on how TikTok, Instagram, Snapchat, and online forums affect mental health.

Read Now

New Fellowship for Community-Led Development Research of Latin America and the Caribbean Now Open

Christopher Everett 773 Academic Funding, Featured, Research

Thanks to a collaboration between the Inter-American Foundation (IAF) and the Social Science Research Council (SSRC), applications are now being accepted for […]

Read Now

New Opportunity to Support Government Evaluation of Public Participation and Community Engagement Now Open

Christopher Everett 1973 Featured, Research

The President’s Management Agenda Learning Agenda: Public Participation & Community Engagement Evidence Challenge is dedicated to forming a strategic, evidence-based plan that federal agencies and external researchers can use to solve big problems.

Read Now