Alvaro de Menard, which we accept as the nom de blog of a non-academic “independent researcher of dubious nature” and who is not an academic, posted a blisteringly amusing critique of modern social science — “What’s Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers” — on the blog Fantastic Anachronism last week. After seeing the piece make a flap on Twitter, we asked for and received permission from de Menard to post the first two of 15 sections of the 8,000+-word post, which includes subsequent sections such as “There Are No Journals With Strict Quality Standards,” “Things Are Not Getting Better,” “Everyone is Complicit,” “Just Because a Paper Replicates Doesn’t Mean it’s Good,” and “There’s Probably a Ton of Uncaught Frauds.” Our excerpt, and the post itself, starts with a quote from a replicant …
“I’ve seen things you people wouldn’t believe.”
Over the past year, I have skimmed through 2578 social science papers, spending about 2.5 minutes on each one. This was due to my participation in Replication Markets, a part of DARPA’s SCORE program, whose goal is to evaluate the reliability of social science research. 3000 studies were split up into 10 rounds of ~300 studies each. Starting in August 2019, each round consisted of one week of surveys followed by two weeks of market trading. I finished in first place in 3 out 10 survey rounds and 6 out of 10 market rounds. In total, about $200,000 in prize money will be awarded.
The studies were sourced from all social sciences disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 (in other words, most of the sample came from the post-replication crisis era).
The average replication probability in the market was 54%; while the replication results are not out yet (250 of the 3000 papers will be replicated), previous experiments have shown that prediction markets work well. (1, 2)
This is what the distribution of my own predictions looks like:
My average forecast was in line with the market. A quarter of the claims were above 76%. And a quarter of them were below 33%: we’re talking hundreds upon hundreds of terrible papers, and this is just a tiny sample of the annual academic production.
Criticizing bad science from an abstract, 10000-foot view is pleasant: you hear about some stuff that doesn’t replicate, some methodologies that seem a bit silly. “They should improve their methods”, “p-hacking is bad”, “we must change the incentives”, you declare Zeuslike from your throne in the clouds, and then go on with your day.
But actually diving into the sea of trash that is social science gives you a more tangible perspective, a more visceral revulsion, and perhaps even a sense of Lovecraftian awe at the sheer magnitude of it all: a vast landfill—a great agglomeration of garbage extending as far as the eye can see, effluvious waves crashing and throwing up a foul foam of p=0.049 papers. As you walk up to the diving platform, the deformed attendant hands you a pair of flippers. Noticing your reticence, he gives a subtle nod as if to say: “come on then, jump in”.
They Know What They’re Doing
Prediction markets work well because predicting replication is easy. There’s no need for a deep dive into the statistical methodology or a rigorous examination of the data, no need to scrutinize esoteric theories for subtle errors—these papers have obvious, surface-level problems.
There’s a popular belief that weak studies are the result of unconscious biases leading researchers down a “garden of forking paths”. Given enough “researcher degrees of freedom” even the most punctilious investigator can be misled.
I find this belief impossible to accept. 4 The brain is a credulous piece of meat. but there are limits to self-delusion. Most of them have to know. It’s understandable to be led down the garden of forking paths while producing the research, but when the paper is done and you give it a final read-over you will surely notice that all you have is a n=23, p=0.049 three-way interaction effect (one of dozens you tested, and with no multiple testing adjustments of course). At that point it takes more than a subtle unconscious bias to believe you have found something real. And even if the authors really are misled by the forking paths, what are the editors and reviewers doing? Are we supposed to believe they are all gullible rubes?
People within the academy don’t want to rock the boat. They still have to attend the conferences, secure the grants, publish in the journals, show up at the faculty meetings: all these things depend on their peers. When criticising bad research it’s easier for everyone to blame the forking paths rather than the person walking them. No need for uncomfortable unpleasantries. The fraudster can admit, without much of a hit to their reputation, that indeed they were misled by that dastardly garden, really through no fault of their own whatsoever, at which point their colleagues on twitter will applaud and say “ah, good on you, you handled this tough situation with such exquisite virtue, this is how progress happens! hip, hip, hurrah!” What a ridiculous charade.
Even when they do accuse someone of wrongdoing they use terms like “Questionable Research Practices” (QRP). How about Questionable Euphemism Practices?
- When they measure a dozen things and only pick their outcome variable at the end, that’s not the garden of forking paths but the greenhouse of fraud.
- When they do a correlational analysis but give “policy implications” as if they were doing a causal one, they’re not walking around the garden, they’re doing the landscaping of forking paths.
- When they take a continuous variable and arbitrarily bin it to do subgroup analysis or when they add an ad hoc quadratic term to their regression, they’re…fertilizing the garden of forking paths? (Look, there’s only so many horticultural metaphors, ok?)
The bottom line is this: if a random schmuck with zero domain expertise like me can predict what will replicate, then so can scientists who have spent half their lives studying this stuff. But they sure don’t act like it.
…or Maybe They Don’t?
The horror! The horror!
Check out this crazy chart from Yang et al. (2020):
Yes, you’re reading that right: studies that replicate are cited at the same rate as studies that do not. Publishing your own weak papers is one thing, but citing other people’s weak papers? This seemed implausible, so I decided to do my own analysis with a sample of 250 articles from the Replication Markets project. The correlation between citations per year and (market-estimated) probability of replication was -0.05!
You might hypothesize that the citations of non-replicating papers are negative, but negative citations are extremely rare. One study puts the rate at 2.4%. Astonishingly, even after retraction the vast majority of citations are positive, and those positive citations continue for decades after retraction.
As in all affairs of man, it once again comes down to Hanlon’s Razor. Either:
- Malice: they know which results are likely false but cite them anyway.
- or, Stupidity: they can’t tell which papers will replicate even though it’s quite easy.
Accepting the first option would require a level of cynicism that even I struggle to muster. But the alternative doesn’t seem much better: how can they not know? I, an idiot with no relevant credentials or knowledge, can fairly accurately determine good research from bad, but all the tenured experts can not? How can they not tell which papers are retracted?
I think the most plausible explanation is that scientists don’t read the papers they cite, which I suppose involves both malice and stupidity. Gwern has an write-up on this question citing some ingenious analyses based on the proliferation of misprints: “Simkin & Roychowdhury venture a guess that as many as 80% of authors citing a paper have not actually read the original”. Once a paper is out there nobody bothers to check it, even though they know there’s a 50-50 chance it’s false!
Whatever the explanation might be, the fact is that the academic system does not allocate citations to true claims. This is bad not only for the direct effect of basing further research on false results, but also because it distorts the incentives scientists face. If nobody cited weak studies, we wouldn’t have so many of them. Rewarding impact without regard for the truth inevitably leads to disaster.
To read the rest of this post, and appreciate some of the footnoting which didn’t translate well onto the Social Science Space platform, please visit Fantastic Anachronism here. We’d be interested in hearing your thoughts about de Menard’s critique of social science, or links to responses you may have made elsewhere. Comment below, or write us at firstname.lastname@example.org.