Not Science, Not Social, Not True
In the summer of 2015, a group of well-credentialed researchers announced that they had tried to replicate the findings of the most influential experiments in the field of social psychology… and 61% of the attempted replications failed.
The Reproducibility Project, as it was known, carried impeccable bona fides. The project was led by the Center for Open Science and its co-founder, professor Brian Nosek of the University of Virginia. The project’s goal was to encourage transparency and data sharing among research scientists. It was the largest undertaking of its kind in the history of the social sciences.
The news that a majority of key social psychology experimental findings were, at best, dubious should have rumbled like an earthquake through higher education, where social science and an obsession with data generally have infiltrated every academic endeavor (“Quantitative Ethics!”).
At first, some social scientists and the journalists who cover them expressed dismay that bad science might underlie so many of their most cherished axioms and practices. But they got over it.
Paying too much attention to the Reproducibility Project’s work would have been a particular blow to science reporters. The meat-and-taters of their trade is the colorful, provocative, and always relevant finding of some new social science experiment:
A new study by researchers at [Harvard, Berkeley, Keokuk Community College] suggests that [dog lovers, redheads, soccer goalies] are much more likely to [wear short sleeves, drink craft beer, play contract bridge] than cat lovers, but only if [the barometer is falling, they are gently slapped upside the head, a picture of Roger Clemens suddenly appears in their cubicle…].
Without such findings, science reporters would find their production of “content” reduced by half or more. The entire mega-selling corpus of the New Yorker social-science writer Malcolm Gladwell would collapse, along with that of his many imitators in the pop-science racket. Marketers who need fresh data, however spurious, to bamboozle clients would suddenly be left empty-handed. Armies of grad students would find themselves with nothing to do. Lots of people have an interest in pretending the Reproducibility Project didn’t happen.
And yet, for anyone except academics and science reporters, the catastrophic replication rate is hard to ignore. Nosek fielded 270 researchers to attempt the 100 replications, and only 39 of the original findings could be confirmed.
If an experiment can’t be repeated and yield much the same results, then the original finding is questionable.
In experimental science, replication functions as the great backstop. If an experiment can’t be repeated and yield much the same results, then the original finding is questionable. And it certainly requires further attempts at replication.
At least that’s how things work in real science… And social scientists are quite insistent that they are as “real” and as rigorous as chemists and physicists. This is why they ape the methodology of the physical sciences. A good sociologist or social psychologist will have a hypothesis, an experiment with which to test it, a place called a “lab” to do the experiment in, and human guinea pigs to sit as experimental subjects. All of this yielding loads of data to study and manipulate, usually with statistics. Just as real scientists do.
Yet attempts at replication are rare in the social sciences. And the absence of replication is merely one way in which social science fails to qualify as science. It is hard to understate how sweeping the consequences from the Reproducibility Project should have been. Many of the foundation stones of social psychology, behavioral economics, and sociology were called into question.
“Priming,” for example, is an almost ubiquitous and problematic practice in social science experiments. Researchers offer subtle or subconscious cues to subjects and then measure their reactions under varying conditions. One seminal study, for example, claimed to show that if subjects were presented (“primed”) with words commonly associated with aging, they would – unconsciously – walk more slowly when they left the psych lab.
Thousands of experiments have been built on the assumption of priming’s effectiveness. Yet the Reproducibility Project researchers failed to replicate the studies that first persuaded social scientists that priming had lasting effects. Since the project’s report, other attempts to replicate the original priming studies have also failed.
The Reproducibility Project generated other surprises, but the real surprise is that anyone should have been surprised. The warning bells have been clanging around social science for many years.
The Cult of Statistical Significance
The “reproducibility crisis” isn’t peculiar to the social sciences. More than a decade ago, a professor of medicine at Stanford named John Ioannidis published a paper with the arresting title, “Why Most Published Research Findings Are False.” He was talking about research in medicine, and his main complaint was about the over-optimistic use of statistics. Since then many attempts to replicate medical research have only underscored his warning.
The main weakness Ioannidis pointed to was the use of “statistical significance” to validate a finding. Statistical significance is a bedrock of social science as well.
Defining statistical significance would require a dip into one of the muddier pools of mathematics. But what it is, in brief, is an analysis of data that shows that your data are data and not just a bunch of junk numbers.
When employed correctly, statistical significance allows a researcher to judge how likely it is that his finding did or didn’t occur by mere chance. It has its uses. In public opinion polling, 70 years of practice have shown that statistical significance is indispensable in determining the likelihood that a polling result is accurate.
For comparison, public opinion polling is an experiment performed on a relatively large number of people (usually between 600 and 1,200) who have been randomly selected from a general population – registered voters, for example. Random selection allows a pollster to generalize from the smaller sample to that larger population. But the subjects in social science experiments are almost never randomly selected. They are often laughably unrepresentative of the general population, and their number is usually very small, for reasons of time and money.
In her brilliant monograph, The Cult of Statistical Significance, co-written with Stephen Ziliak, the economist Deirdre McCloskey showed why the difference between the two selection methods is important. If a group of subjects isn’t randomly selected, then you can’t accurately generalize and “scale up” your findings. So social scientists have found a workaround.
As a measure of whether an experimental finding is “true,” they have substituted the standard of “statistical significance” for the standard of common sense.
As a measure of whether an experimental finding is “true,” they have substituted the standard of “statistical significance” for the standard of common sense. That is, instead of gathering a large, random sample, researcher take the data generated by their small, nonrandom samples and subject it to various kinds of statistical manipulation. Then, when the data show, or seem to show, some kind of “significant” pattern the researchers claim their finding is valid.
Like Ioannidis, McCloskey showed that such methods always run the risk of confusing faulty data (“statistical noise”) for meaningful data. Indeed, she and Ziliak said, an obsession with making the numbers appear statistically significant can obscure a vast array of methodological flaws.
McCloskey’s book all by itself should have caused an about face in social-science research. Instead, it was published, blandly praised, and never rebutted. For convenience’s sake, it was tossed down the memory hole, where it was later joined by the Reproducibility Project.
The statistical weakness has even become the subject of satire. In 2011, researchers from the University of Pennsylvania and UC Berkeley assembled a group of 20 undergraduates and played them Beatles records, gauging their reactions with a series of questions before and after. Pushing the data to a point of “statistical significance,” the researchers were able to “prove,” ridiculously, that listening to “When I’m Sixty-Four” actually reduced the calendar age students’ by an average of 18 months.
The process of social science – by which bad methods lead to bad experiments, which lead to bad findings, which lead to bad papers published in overrated journals, which lead to bogus stories on NPR and in the Washington Post – has been called “Natural Selection for Bad Science.” It’s as if an invisible hand were guiding researchers into faulty practices at each stage. The headwaters of this process is known as “publication bias.”
For the young social scientist in a tenure-track job at a university, “publish or perish” is a pitiless mandate. Editors of academic journals want to publish papers that bring favorable attention from journalists who crave something novel and flashy to report. A paper describing a failed experiment – even if this negative result is scientifically significant – is unlikely to find a home in a professional journal. The bias for positive results encourages the researcher to tweak the data until it yields anything that looks like new information. Social scientists look at a mound of data like the boy in the old joke… “There must be a pony in there somewhere.”
Surveys have shown that published studies in social psychology are five times more likely to show positive results – that is, to confirm the experimenter’s hypothesis – than studies in the physical sciences. This means one of two things. Either social science researchers are the smartest and luckiest researchers in the history of experiments, or something has gone very wrong.
Studies in Kiddieland
The central conceit of social science is that its experiments will yield generally applicable truths about the entire human race. A couple years ago a Canadian economist named Joseph Henrich did the math and found that 70% of published social-science studies are generated in the United States, and that the subjects of more than two-thirds of those studies are exclusively U.S. university undergraduates.
College students, Heinrich noted, form “one of the worst subpopulations one could study for generalizing about Homo sapiens.” They are whiter and richer than the general U.S. population, and much whiter and richer than populations of most other countries.
Social scientists boast that social science is the “study of real people in real-life situations.” It’s really the study of college students sitting in psych labs.
Social science experiments range from the dubious to the preposterous. Usually the kids are offered a course credit or a bit of cash for their participation. They fill out a questionnaire, respond to images on a screen, or roleplay “real life” situations made up by the scientists.
Consider one set of well-known experiments. Together they form, in the words of a New York Times columnist, “an extensive academic critique of the right.” This allegedly scientific enterprise manages to prove that conservatives and Republicans lack compassion and tolerance and are quicker to act unethically than their counterparts on the left.
How do we know? Well, here’s how…
Several years ago, graduate students at UC Berkeley managed to corral 118 undergraduates for an experiment. They were given course credit or $15 for their cooperation. The subjects bore no resemblance to any wider population. Most were under the age of 21. By definition, all 118 were the kind of kid who goes to Berkeley. Only 3.5% were African-American, and nearly half were Asian-American. (According to Gallup, Asian-Americans are the only ethnic group in which a majority describe themselves as politically liberal.)
The researchers wanted to know how “powerful people” see the world. The researchers asked the kids to fill out a questionnaire answering such queries as “Agree or disagree: I think I have a great deal of power.” Then the subjects were divided into pairs and told to sit facing each other, two feet apart. They were wired to an electrocardiogram and filmed by video cameras.
The subjects were told to tell their partners about some traumatic incident in their lives. The electrocardiogram registered reactions.
As a simulation of human behavior, the experiment was absurd. But its conclusions were entered into the canon of truth according to social science.
Once the data had been teased sufficiently the scientists reported, “that social power attenuates emotional reactions to those who suffer.” As a simulation of human behavior, the experiment was absurd. But its conclusions were entered into the canon of truth according to social science.
In subsequent experiments and later press accounts, “powerful people” were explicitly identified as “conservative” or “Republican” – a nice, but not necessarily very accurate, compliment to those of us who are conservative or Republican. Equally artificial follow-up studies “confirmed” Republican character defects like discomfort with ambiguity, a preference for stereotypes, and a hair-trigger fear of threatening situations, among others.
And you’ll be relieved that science has discovered liberals are much better. They’re “open to experience,” “tolerant of difference,” and “comfortable with ambiguity.”
In the “academic critique of the right,” you find a catalogue of scientific derelictions. Small sample sizes and limited sample types are only the beginning. There’s shoddy data collection, undefined terminology, statistical malfeasance, a lack of control groups, and a willingness to change hypotheses mid-experiment to conform to the data. In the physical sciences, any one of these would be enough to disqualify the work.
Nobody Here but Us ‘Scientists’
So, how do they get away with it? How does an endeavor so transparently implausible get accepted as science? Social-science research, aside from glimmers of hope like the Reproducibility Project, has been a closed circle. Bad practice reinforces itself, with little room for the self-correction that is essential to scientific progress.
The situation is made worse by the field’s ideological monochrome. Among academic disciplines, social science is the least politically diverse. In a survey of the membership of the Society for Personality and Social Psychology, 85% of respondents called themselves liberal, 6% identified as conservative, and 9% identified as moderate. And only 2% of graduate students and postdocs called themselves conservative.
“The field is shifting leftward,” wrote one team of social psychologists (identifying themselves as “one liberal, one centrist, two libertarians, two who reject characterization,” and no conservatives). “And there are hardly any conservative students in the pipeline.”
Most serious social scientists will acknowledge that the field leans left. They’ll also insist that politics doesn’t contaminate their science. This ignores “confirmation bias,” one of the few well-established findings in social psychology. In plain English, the phrase means we tend to believe what we want to believe. Bias is hard to see when everyone you work with is biased in the same direction. Recall the fish who was asked how the water felt… “What the hell’s water?” he replied.
Consider again that “extensive academic critique of the right.” One hugely influential paper summarizes its findings like so:
A meta-analysis confirms that several psychological variables predict political conservatism: death anxiety; system instability; dogmatism… fear of threat and loss; and self-esteem. The core ideology of conservatism stresses resistance to change and justification of inequality and is motivated by needs that vary situationally and dispositionally to manage uncertainty and threat.
This is almost self-parody. Most American conservatives I know favor economic deregulation, want to abolish multiple federal agencies, and welcome the creative destruction of the free market, which is a dumb way to resist change. Notwithstanding its wild inaccuracy, this paper has been cited as sober science in more than 2000 other academic papers since its publication.
The existence, power, frequency, and stubbornness of microaggressions are now taken as settled facts.
Micro-hooey
The leftward tilt of social science ensures that it has become a handmaid for the most fashionable ideological fads. The current rage for “microaggressions” is rooted in social science performed more than a decade ago. The existence, power, frequency, and stubbornness of microaggressions are now taken as settled facts.
Multiple areas of American life – from policing to education – have been reshaped accordingly. It is the basis of countless “diversity” seminars and training programs in fire departments, corporate workplaces, government agencies, and universities (of course). At the University of Wisconsin, use of the phrase “politically correct” is now officially considered a microaggression. If you describe America as the “land of opportunity” anywhere in the University of California system, you’re judged to be microaggressing and told to knock it off. Or else.
The microaggression panic happened without anyone stopping to double check the science behind the concept. You won’t be surprised to learn it’s not very good – the rotten fruit of ideological wishful thinking. The phrase was first popularized a dozen years ago in a paper by a professor at Columbia University’s education school. He and his team discovered more than twenty microaggressions against people of color. Among them: hanging pictures of white U.S. presidents on your wall (it sends the signal that only white men can succeed).
The paper was an instant smash, thanks to confirmation bias. It fit the leftish narrative shared by social scientists – that bourgeois Americans were in thrall to destructive and harmful stereotypes that could only be rectified by aggressive reprogramming. Soon researchers were discovering microaggressions against the poor, the disabled, women, and members of the LGBTQ+ community.
The research, such as it was, entered the cultural and political bloodstream unchecked. At last, in 2017, a well-known psychologist named Scott Lilienfeld surveyed the scientific literature of microaggressions for Perspectives on Psychological Science. Astonishingly, Lilienfeld found, no researcher had even taken the trouble to try to replicate the original Columbia study, though thousands of subsequent papers had relied on it as sound science. (Over the past decade, the Columbia study has been cited in social-science papers an average of three times a week.)
Entire experiments, Lilienfeld wrote, often consisted of nothing more than focus groups of ten or twelve subjects. Researchers encouraged them to describe the everyday encounters and inadvertent comments they found racially insensitive. And the researchers wouldn’t take no for an answer. Members of the focus groups were seldom identified by any meaningful criterion, and their reactions were tossed together and published as further proof of the microaggression epidemic.
So the literature piled up until the stack of studies was described, by journalists and researchers alike, as “overwhelming evidence” for the reality of microaggressions. “Scientists have discovered…” “Studies show…” “Research reveals…” Et voila. Social Science!
Failing Failsafe
All experimental sciences rely on the failsafe of peer review before publication. It’s expected that shoddy research papers will be caught by a panel of two or three academics hired to double check the soundness of the work before it’s published. Yet in recent years, the weaknesses of the system have become undeniable.
Most of us probably think “peer review” means that a third party has replicated the research and confirmed its finding. But, as we’ve seen, replication is almost never attempted in social science, certainly not at the level of peer review. Reviewers are busy careerists who give the paper a cursory review for obvious errors, at best. Being anonymous, they will pay no price if they get it wrong.
Three years ago, the former editor of the British Medical Journal told the Royal Academy in London about an experiment of his own. A paper containing eight deliberate errors was sent to 300 researchers for peer review. No reviewer found more than five of the errors, most of them found two, and one in five of them found none.
“If peer review was a drug it would never get on the market,” he said, “because we have lots of evidence of its adverse effects and don’t have evidence of its benefit.”
“Much of the scientific literature, perhaps half, may simply be untrue.”
As a result, said an editorial in the Lancet, “much of the scientific literature, perhaps half, may simply be untrue.” If this is true of the physical sciences, it can only be worse in the fields that aim to be like the physical sciences but always fall short.
Hayek and Humility
Friedrich Hayek pointed to the fundamental problem in his Nobel prize speech in 1974. Social sciences are of strictly limited use by their very nature. Human actions are infinitely complicated in motive, execution, and circumstance. In their fullness, they cannot be reduced to data.
“A theory of essentially complex phenomena” – the aim of all social science – “must refer to a large number of particular facts,” Hayek wrote. “To derive a prediction from it, or to test it, we have to ascertain all these particular facts.”
Which, Hayek said, is impossible. It’s touching to think of the childlike faith of researchers who think they can reproduce and quantify real-world human behavior in their campus psych labs, and thereby discern enduring truths about our nature. Childlike – or slightly sinister?
Hayek went on: “To act on the belief that we possess the knowledge and the power which enable us to shape the processes of society entirely to our liking, knowledge which in fact we do not possess, is likely to make us do much harm.”
Yet we persist in doing social science, in blithely reporting and accepting its findings, adjusting government policy and our own behavior according to it, against all evidence, even as its conceits unravel in plain sight. It’s not hard to see why some of us persist.
The great economist Kenneth Arrow worked as a statistician in World War II. One of his jobs was to analyze weather forecasts and send them on to his commanding general.
It wasn’t long before Arrow and his colleagues discovered that the forecasts were essentially worthless; no forecast had more than a 50% chance of being correct. Shocked, he sent this alarming information to his superiors.
After several days, he got a response.
“The commanding general knows the forecasts are no good,” Arrow was told. “But he needs them for planning purposes.”
Andrew Ferguson is the author of several books, including Crazy U: One Dad’s Crash Course on Getting His Kid Into College. He is a former speechwriter for President George H. W. Bush and a current senior editor at The Weekly Standard.