Last month, a new article was published online in the Journal of Advanced Academics on stereotype threat. Entitled, “Stereotype Threat and Its Problems: Theory Misspecification in Research, Consequences, and Remedies” (Picho-Kiroga et al., in press), I read the article because I have blogged about stereotype threat in the past, and it is the subject of Chapter 30 of my book In the Know: Debunking 35 Myths About Human Intelligence. I thought the authors might have new insights into the topic’s research flaws.

I was sadly mistaken. Instead of an insightful view into the problems of stereotype threat research, I instead found an article that itself is plagued with flaws. If this is what publishable research on stereotype threat looks like, then the topic is in far more trouble than I expected.

Summary of the Article

Picho-Kiroga’s (in press) article is a meta-analysis on stereotype threat. The author’s interest is in the heterogeneity found in research on stereotype threat in females, and they theorize that the varying results found in studies may be due to methodological differences across studies.

It is a good topic to investigate. Methodological heterogeneity often leads to result heterogeneity, and understanding what methodological artifacts are associated with different results is important for understanding the actual nature of a phenomenon.

A Tale of Two Article Versions

The original version of the article was published January 24, 2021, and I read it the next day. I raised my concerns to the editors in an email in the wee hours of January 26, 2021. To the editors’ and the authors’ credit, some of the problems I identified have been fixed. These include:

  • The flow diagram of excluded studies in Figure 1 was incorrect; one box said that 138 full-text articles were included, but the studies within the box only summed to 139.
  • Table 2 had at least three errors I found. The studies listed as occurring in Western cultures did not sum to the number of studies that the table reported (which was 96). The values of effect sizes reported in the table did not match two places in the text which discussed the table.
  • One of the confidence intervals in the table included zero, but the test was marked as being statistically significant. This is a mathematical impossibility.
  • A supplemental file describing how studies were evaluated for methodological quality was added.
  • Heterogeneity statistics were added.
  • A greater level of certainty was applied when interpreting wide confidence intervals that supported stereotype threat than confidence intervals that were equally wide but did not support the theory.
  • An error in the formula for Cohen’s d (a formula so basic that it appears in my introductory statistics textbook, Statistics for the Social Sciences: A General Linear Approach) was corrected.

I applaud the editors for forwarding my concerns to the authors and the authors for correcting these errors in the article. The corrections were made promptly, and a new version of the article was uploaded some time between February 9 and February 15, 2021.

But the current version of the article is nearly the same as the last one, and the authors seemed to have performed the minimum number of changes they could. Unfortunately, major problems remain in the article.

Statistical Flaws in the Article

Use of post hoc statistical power

The first major remaining problem with Picho-Kiroga et al.’s (in press) is their use of post hoc statistical power in their analyses. Post hoc power is a flawed tool because it is just a transformation of p-values: studies that reject the null hypothesis will automatically have high power, while studies that retain it will always have low power, even if the former reject the null hypothesis solely because of sampling error or questionable research practices. Statisticians have known for years that post hoc power is a biased estimator of the true power, with the post hoc power usually being inflated (Yuan & Maxwell, 2005).

What makes the use of post hoc statistical power so problematic in Picho-Kiroga et al.’s (in press) meta-analysis is that they use it as a covariate in their meta-analysis and as part of a methodological quality score they gave to every study. (This score was also a covariate in an analysis.) By using post hoc power for these purposes, the studies that reject the null hypothesis have inflated methodological quality scores, and the use of power as a covariate is conflated with studies’ p-values. Both of these are extremely undesirable when trying to understand the heterogeneity of a body of research.

The evidence is clear that the power statistics in Picho-Kiroga et al.’s (in press) meta-analysis are inflated. They reported (p. 18) a mean statistical power of .59 (SD = .26). Yet, the median sample size in their meta-analysis was n = 43.5, and the mean effect size was .28. Given these values, the average statistical power should be about .16. Had the authors used the effect sizes from previous meta-analyses they cited to estimate power (as would is the correct procedure), the mean statistical power would have been between .10 and .26. (For you non-scientists, statistical power ranges between 0 and 1, and .80 is the general rule of thumb for a desirable level of power.)

In response to my concerns about the use of post hoc power, the authors just changed the term “post hoc power” to “observed power” (p. 13) and did not recalculate any power statistics. All analyses that used statistical power as a covariate or the methodological quality score were completely unchanged. The problem with the original version of the article was not semantics, but rather statistics.

Ironically, concerns about low statistical power are scattered throughout the text of the Picho-Kiroga et al. (in press) article (see pp. 13, 18, 19, 20, 21). But the authors do not seem to understand the different types of power nor the severity of the problem of low statistical power in the studies they analyzed.

Blindness to obvious publication bias

Another severe problem with the article is the clear evidence of publication bias in the study, which the authors did not identify. The presence of publication bias is obvious in the funnel plot on p. 19:

Funnel plot from Picho-Kiroga et al. (in press, p. 19). A funnel plot is used to investigate the presence of publication bias.

Each dot in a funnel plot represents an effect size from a study. In theory, the dots should be distributed symmetrically around the average observed effect size. (If the average observed effect were zero, then the dots would center around the white region; in this diagram, the dots should center around d = .28, because that was the average observed effect.) However, when publication bias is present, the distribution of the dots is non-symmetrical, usually with an excess of dots on the right side of the image.

As is apparent in the image, an asymmetrical distribution of effect sizes is what happened in Picho-Kiroga et al.’s (in press) study. This is exactly what publication bias looks like in a funnel plot. In fact, the surplus of dots on the right is so obvious, that one commenter on Twitter stated:

How did the authors interpret their funnel plot? Here is what they said:

From the funnel plot, it is evident that there are just as many null stereotype threat effects as there are significant effects. That there are studies to the left of the funnel also indicates reverse stereotype threat effects. That is, for some studies, women performed better when exposed to stereotype threat than their control counterparts. Begg’s test was not statistically significant, indicating the absence of small study effects and therefore no publication bias (z = −0.71, p = .481).

Picho-Kiroga et al. (in press, p. 18)

My reaction on Twitter sums up this interpretation well:

The fact that “there are just as many null stereotype threat effects as there are significant effects” is irrelevant to the question of whether publication bias occurred. Tallying up the number of studies that reject or retain the null hypothesis to reach a conclusion about anything was already outdated methodology decades ago (Glass, 1976, 1977; Meehl, 1990). If this procedure could reveal anything, meta-analysis wouldn’t be needed. We could just count up study results instead.

Furthermore, the presence of some negative effect sizes also does not indicate a lack of publication bias — contrary to Picho-Kiroga et al.’s (in press) belief. Across studies, effect sizes form a distribution, due to regular ol’ sampling error, and that distribution will vary around a central parameter (which will be the average effect size, if no publication bias or effect size heterogeneity are present). Because most social science effect sizes tend to be small (around d < .40 in most topics in psychology), some of the distribution of effect sizes will naturally “spill over” into the region of negative effect sizes. All other things being equal, the smaller the effect size, the more likely this is to happen, even if publication bias is present.

The only scenarios where a total absence of negative effect sizes would be expected would be: (1) the true effect size is so large that the lower tail of the distribution of sample effect sizes does not include zero, given the number of effect sizes in the distribution; or (2) publication bias is so strong and pervasive that no study with a negative effect size is published. Scenario 1 definitely did not happen in Picho-Kiroga et al.’s (in press) meta-analysis because the effect size is so small (mean d = .28, SD ≈ .60) that negative effect sizes are inevitable.

Scenario 2 also did not happen, which rules out publication bias that perfectly filters out every study with a negative effect size. But publication bias does not need to be this perfect to distort and/or inflate the results of a meta-analysis. In a quick and dirty simulation I did on Excel, a distribution of 102 effect sizes with a true mean of zero and a standard deviation of .60 had a meta-analysis average effect size of d = .186 and SD = .590 with a moderately intense publication bias that favored positive effect sizes over negative effect sizes by 2:1.

Additionally, in that simulation, there were almost as many effect sizes that retained the null hypothesis (39) as rejected it in favor of stereotype theory (44), showing that a similar number of studies that reject or retain the null hypothesis is not proof of publication bias.

What about Begg’s test that Picho-Kiroga et al. (in press) conducted? It is true that it retained the null hypothesis, but this did not indicate “no publication bias.” Instead, it indicates that Begg’s test failed to detect evidence for publication bias. (Remember, kids: absence of evidence is not evidence of absence!) Begg’s test has low statistical power to detect publication bias (Sterne et al., 2000), and there is no single method of investigating publication bias that is perfect (Chambers, 2017). To their credit, Picho-Kiroga et al. (in press) also calculated the failsafe N number, but this method is even worse at detecting publication bias (Becker, 2005; see also this blog post).

The funnel plot gives a clue to one method that would have likely detected publication bias: a p-curve. This is apparent when examining the medium grey ribbon on the right side of the funnel plot, which represents the region where .01 < p < .05. Over half(!) of the studies that rejected the null hypothesis are in that region, and no other region of the funnel plot is so densely population with effect sizes. Likely, a p-curve analysis would have shown a suspicious peak of p-values just below .05, with lower p-values being less frequent (a suspicious pattern of results). The surplus of studies supporting stereotype threat with p-values between .05 and .01 is also evidence of rampant p-hacking in the stereotype threat literature.

I explained all of this to the journal editors in my original email, but the authors do not mention p-hacking in their revised version of the article, nor were there any changes to their analysis of publication bias. I do not know why the authors thought it was best to keep that section of the manuscript unchanged, even after the flaws had been pointed out to them.

The Man on the Wing of the Stereotype Threat Plane: Nothing to Worry About

The statistical problems in the Picho-Kiroga et al. (in press) article are fatal. But the authors also have intractable problems with the logic of how they interpret their results.

Based on Steele’s (1997) explanation of stereotype threat, Picho-Kiroga et al. (in press, p. 6) claim that there are three “essential” components of a situation for a stereotype threat to be triggered:

  1. The examinee must identify with the stereotyped domain. In the Picho-Kiroga et al. (in press) meta-analysis, this would mean that a female examinee must identify with a math or STEM domain by majoring in it, considering herself a “math person,” etc.
  2. The examinee must be aware of the negative stereotype about their demographic group. For this meta-analysis, that would mean that females are aware of stereotypes like “girls can’t do math.” It is not essential for the person to believe the stereotype, though.
  3. The task — such as a test — must be challenging to the examinee.

One of the astute methodological decisions from Picho-Kiroga et al. (in press) is the decision to code studies in their meta-analysis on the basis of the number of essential components that were included. Their results indicate that only 7 of 102 effect sizes (6.9%) were from studies that included all three components. Most had only one (38 of 102, or 37.2%) or two (53 of 102, or 52.0%) components. Four studies (3.9%) had none of the essential components. They then searched for a relationship between the number of essential components and studies’ effect sizes.

For true believers in stereotype threat, the results are shocking. As displayed in Table 2 (below), as the number of essential components increased, the strength of the stereotype phenomenon decreased.

Table 2 from Picho-Kiroga et al. (in press). The highlighted portion shows that as the number of “essential” components of stereotype threat increased, the effect grew weaker. This is exactly the opposite of what stereotype threat theory would predict.

According to the table, the strongest stereotype threat effects occurred in studies that had none of the “essential” components of stereotype threat. The effect grows progressively weaker as “essential” components are added, until it statistically disappears when all three are present.

These results indicate that Steele’s (1997) “essential” components are nothing of the sort; instead, stereotype threat is most apparent in the studies that it supposedly shouldn’t show up in at all. If theories are to be judged on the basis of their ability to make empirical predictions, then there is absolutely no reason to believe that stereotype threat theory is an accurate theory of females’ cognition in non-gender typical domains.

I cannot overstate how much of a spectacular prediction failure this is for stereotype threat theory. This is as if there were a gremlin on the wing of the stereotype threat plane, and as it pulled out “essential” pieces of the engine, the plane worked better than ever.

There’s a gremlin on the wing of the plane! But if this is stereotype threat, his tinkering with the engine will only make the plane fly better — at least, according to stereotype threat advocates. There is no “Nightmare at 20,000 Feet.” Just relax! Everything is fine with the theory, er . . . plane.

How did Picho-Kiroga et al. (in press) interpret this strange pattern of effect sizes across studies? As an improper or incomplete application of stereotype threat theory distorting the results. For example, they suggest (Picho-Kiroga et al., in press, p. 22-23) that sample heterogeneity would lower a group’s scores by including women who score lower anyway on a math test because they don’t identify with a math or STEM domain.

But sampling heterogeneity cannot explain the pattern of effect sizes in Table 2. All the studies in the meta-analysis were experiments (p. 10), which presumably means there is some sort of random assignment of individuals to groups. (See Figure 1, where studies are eliminated from the meta-analysis for being non-experimental.) This is important because random assignment balances out groups; if any lower-scoring non-susceptible females were in a study, then they could not increase the stereotype threat effect size; if anything, they would decrease it. This is true whether stereotype threat is evaluated between groups, or within groups in a pre-test/post-test situation.

Even if this explanation were logical, it only explains the problems with one of the three “essential” components. This would not explain why studies that fail to show stronger effects when examinees are made aware of the stereotype or if the task is not challenging. Generally, Picho-Kiroga et al. (in press) propose no compelling explanation for why fewer “essential” components leads to stronger stereotype threat effects.

Much Ado About Nothing

I want to give credit where it is due, though and point out where Picho-Kiroga et al.’s (in press) interpretation is correct:

That the average stereotype threat effect for this particular category of studies was grossly inflated also suggests that most stereotype threat effects reported in the literature might be highly inflated.

Picho-Kiroga et al. (in press, p. 21)

Yes, the stereotype threat literature is full of inflated effect sizes. But these effect sizes aren’t inflated by methodological differences in the number of “essential” components of stereotype threat. Instead, it is inflated by rampant publication bias and p-hacking.

There is a clue in the literature to how inflated these results are: In a pre-registered stereotype threat in female high-achieving adolescents with a sample size over 3 times larger than any study included in Picho-Kiroga et al.’s (in press) meta-analysis, Flore et al. (2018) found that the stereotype threat was d = .03 and not statistically different from zero. Given the field’s unfettered publication bias, rampant p-hacking, and a zero effect in the most methodologically sophisticated study on the topic, the best interpretation of the gender stereotype threat literature is that the effect is not real.

Therefore, I agree with Picho-Kiroga et al. (in press) that

Consequently, stereotype threat effects from such studies tend to be biased. Therefore, it is likely that true stereotype threat effects on women’s quantitative performance might be smaller than currently reported in the literature. . . . Thus, the common notion that stereotype threat significantly contributes to gender gaps in STEM is more than likely an overstatement.

Picho-Kiroga et al. (in press, p. 23)

I agree. However, their conclusion that “. . . stereotype threat has been demonstrably small in reliable, high-quality studies . . .” (Picho-Kiroga et al., in press, p. 24) is not supported by their own meta-analysis. Indeed, their own study contradicts that. The best designed studies have an average effect size that is not statistically distinguishable from zero, and the literature is plagued by publication bias and p-hacking. The relationship between “essential” components in a study and the apparent strength of the stereotype threat phenomenon is the exact opposite of what the theory predicted.

What Now?

On February 16, 2021, I contacted the editors of the Journal of Advanced Academics and told them that problems remained with the article. (I also spotted a new problem when examining the revised version: the effect sizes given in Table 1 do not match those shown in the funnel plot in Figure 2, likely because of some undisclosed reverse coding.) The article is in their hands.

I don’t think that Picho-Kiroga and her coauthors are going to agree with me on every point. Over half a decade of following the replication crisis has made me cynical, and my prior for the existence of stereotype threat (and other phenomena based on research with low statistical power and high publication bias, like mindset theory) is much lower than theirs. In their article, they stated,

What next for stereotype threat research? Do we throw the baby out with the bath water? That would be unnecessary. Rather, a commitment to the improvement in research methodology . . . is in order.

Picho-Kiroga et al. (in press, p. 24)

If they want to keep pursuing phantoms and Type I errors, that is fine. If the research improves — especially as a priori statistical power increases and authors pre-register their studies — I suggest that Picho-Kiroga and her colleagues should prepare themselves to be very disappointed.

Update

On March 5, 2021, one of the editors of the Journal of Advanced Academics told me that the article would be fully published in its current (i.e., second) version. I was, however, invited to write a response that they would consider for publication. I have not yet decided whether I will.

Second Update

I did write a response to the article. On November 29, 2021, the editors published it in the Journal of Advanced Academics (Warne, in press). I am told that more corrections are coming to the Picho-Kiroga et al. (2021) article, but that there are no plans for an expression of concern or retraction.

Third Update

On January 6, 2022, the Journal of Advanced Academics has published a corrigedum to the original Picho-Kiroga et al. (2021) article. It is 14 pages long and makes corrections to, literally, every table and figure in the original article. And the corrigedum still does not address the problem of using the wrong type of statistical power or in drawing illogical conclusions about the impact of adding “essential” conditions to a stereotype threat study. (Keep in mind, this corrigedum was published after a batch of stealth edits to the PDF version of the article in February 2021, before the article was published in print form.)

References

Becker, B. J. (2005). Failsafe N or file-drawer number. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment and adjustments (pp. 111-126). John Wiley & Sons.

Chambers, C. (2017). The seven deadly sins of psychology: A manifesto for reforming the culture of scientific practice. Princeton University Press.

Flore, P. C., Mulder, J., & Wicherts, J. M. (2018). The influence of gender stereotype threat on mathematics test scores of Dutch high school students: a registered report. Comprehensive Results in Social Psychology, 3(2), 140-174. https://doi.org/10.1080/23743603.2018.1559647

Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5(10), 3-8. https://doi.org/10.3102/0013189X005010003

Glass, G. V. (1977). Integrating findings: The meta-analysis of research. Review of Research in Education, 5(1), 351-379. https://doi.org/10.3102/0091732X005001351

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195-244. https://doi.org/10.2466/pr0.1990.66.1.195

Picho-Kiroga, K., Turnbull, A., & Rodriguez-Leahy, A. (in press). Stereotype threat and its problems: Theory misspecification in research, consequences, and remedies. Journal of Advanced Academics. https://doi.org/10.1177/1932202×20986161

Steele, C. M. (1997). A threat in the air: How stereotypes shape intellectual identity and performance. American Psychologist, 52(6), 613-629. https://doi.org/10.1037/0003-066X.52.6.613

Sterne, J. A., Gavaghan, D., & Egger, M. (2000). Publication and related bias in meta-analysis: Power of statistical tests and prevalence in literature. Journal of Clinical Epidemiology, 53(11), 1119-1129. https://doi.org/10.1016/S0895-4356(00)00242-0

Warne, R. T. (in press). No strong evidence of stereotype threat in females: A reassessment of the Picho-Kiroga et al. (2021) meta-analysis. Journal of Advanced Academics. https://doi.org/10.1177/1932202X211061517

Yuan, K.-H., & Maxwell, S. (2005). On the post hoc power in testing mean differences. Journal of Educational and Behavioral Statistics, 30(2), 141-167. https://doi.org/10.3102/10769986030002141

css.php