Last week’s post about the flimsy evidence regarding stereotype threat (and the massive amounts of money wasted chasing stereotype threat effects) was far more popular than I expected. The tweet where I publicized it was my most viewed tweet in over a year, and already it is one my most popular blog posts ever.
So, I decided to follow up on that success to discuss one apparent (but misleading) strength of the stereotype threat literature: its sheer size. Hundreds of studies have been published on stereotype threat, most of which provide evidence supporting stereotype threat. Claude Steele himself mentioned this in discussing why he believed stereotype threat was a real phenomenon:
in the face of hundreds of conceptual replications of the effect across different groups, different stereotypes, behavioral and physiological measures, by hundreds of investigators some of whom are skeptics of the effect, in different places all over the world, I don’t think it’s rational to doubt the existence of the effect on the basis of one or even several replication failures of a particular experimentClaude Steele, quoted in Nussbaum, 2017, para. 6
Could most or all of these studies really be wrong? Yes, they can–especially when they are all contaminated with small sample sizes, flexible statistical analysis, and a known publication bias.
Steele is not the first psychologist to use this argument. In the 1880s, Alfred Binet was working with Jean-Martin Charcot. The latter had published research (supposedly) showing that a handheld magnet outside of the skull could produce changes in emotion and in nervous system functioning in hypnotized people. In response, Hippolyte Bernheim argued that the magnetic effect was not real and was instead the product of the experimenter suggestions to an easily influenced subject.
Binet countered, arguing that it was improper for Bernheim to disregard magnetic effects so easily, “. . . a problem that has been so seriously studied by so many distinguished men” (quoted in Wolf, 1973, p. 57). Binet thought that it was highly unlikely that a finding that had been “replicated” so many times in so many settings could be false. This is the exact same logic that Steele engages in when defending stereotype threat research by claiming it has been demonstrated by so many experimenters.
But of course, it was false. Small magnets do not make hypnotized people change emotional states. Bernheim was right, and Charcot and Binet were wrong. The fact that the “effect” of magnets had been “demonstrated” so many times was irrelevant. Likewise, until stereotype threat researchers eliminate publication bias, low statistical power, and flexibility in conducting and analyzing their studies, the hundreds of studies on the phenomenon are not evidence for the reality of stereotype threat.
In honor of Binet, I propose that we name this specious reasoning the “Fifty million Frenchmen can’t be wrong” fallacy. A single, high-quality, pre-registered, large n study is infinitely more valuable than hundreds of low-quality studies with lower power.
The logic that “fifty million Frenchmen can’t be wrong” is a primitive version of “vote counting” to assess the scholarly literature. In vote counting, a person examines all the studies about a topic and tallies up the number that support a theory and the number that contradict a theory. The result with the largest number of “votes” (i.e., studies agreeing with it) is seen as the stronger theory.
While it sounds good, methodologists have known for decades that vote counting produces results that are sometimes the exact opposite of the truth (Glass, 1976, 1977; Meehl, 1990), especially when publication bias distorts the literature (Vadillo et al., 2016)–as we know it does with stereotype threat research.
If the best justification for the strength of stereotype threat literature is a logically flawed argument based on a long-discredited methodology, then the stereotype threat literature is in serious trouble. Claude Steele and other psychologists who believe that stereotype threat is real should probably gather better evidence from large n, pre-registered studies that report all data transparently. Until then, I think everyone should be skeptical about the existence of stereotype threat.
Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5(10), 3-8. https://doi.org/10.3102/0013189X005010003
Glass, G. V. (1977). Integrating findings: The meta-analysis of research. Review of Research in Education, 5(1), 351-379. https://doi.org/10.3102/0091732X005001351
Nussbaum, D. (2017, November 27). Claude Steele’s comment on a quote in Radiolab’s recent program on stereotype threat. Medium. https://medium.com/@davenuss79/claude-steeles-comment-on-a-quote-in-radiolab-s-recent-program-on-stereotype-threat-e67a55aaae94
Vadillo, M. A., Hardwicke, T. E., & Shanks, D. R. (2016). Selection bias, vote counting, and money-priming effects: A comment on Rohrer, Pashler, and Harris (2015) and Vohs (2015). Journal of Experimental Psychology: General, 145(5), 655-663. https://doi.org/10.1037/xge0000157
Wolf, T. H. (1973). Alfred Binet. The University of Chicago Press.