[NOTE: After publishing this post, I was informed (first by Avraham Eisenberg and later by Emil Kirkegaard) that some the following information was incorrect. For the sake of transparency, I am leaving the incorrect text posted but crossed out so that people can see my original error and understand that I no longer stand by those claims.]

Charles Murray has a new book that was released last week, Facing Reality: Two Truths About Race in America. In it, he discusses two facts about average differences in behavior across the largest American racial groups: intelligence and criminality. Murray (2021) does not state anything about these differences that is new to experts — or to any informed person not blinded by wokeness. His contribution is to say that refusal to talk about race differences is itself harmful to the American experiment and, in the long run, makes it harder to solve important societal problems.

Because some of the book discusses intelligence differences, people are out in full force to try to undermine Murray ‘s conclusions that are based on IQ data. Most of the arguments against intelligence tests, group comparisons, or Murray’s interpretations are old chestnuts that I discuss in my book In the Know: Debunking 35 Myths About Human Intelligence (Warne, 2020).

However, there is one argument that I saw on social media during the weekend that is absent from the book and that test developers design tests to force males and females to have equal average IQ scores.

The implication is that if test creators can force an average difference to disappear for one pair of groups (i.e., males and females), then any differences or lack of differences are also engineered into the tests. If this were correct, it would undermine Murray’s discussion of average IQ differences between races because those differences would be nothing more than an artifact of test creation.

There is no evidence whatsoever to indicate that test creators are purposefully creating or eliminating differences in average IQ scores for any demographic groups. I have never found any such procedure mentioned or used in the test manuals or technical documentation for any intelligence, cognitive, or academic test.

The closest thing I can find to support this assertion is that many test battery creators like to balance the number of verbal and non-verbal tasks on a test, a practice that dates back to David Wechsler’s original test for adults (the Wechsler-Bellevue) in 1939. Because, generally, females tend to excel on verbal tasks and males tend to excel on non-verbal tasks, some people take this to indicate that Wechsler was trying to equalize average IQs for the two sexes. However, this is not true for two reasons. First, Wechsler balanced verbal and non-verbal tasks because his experience led him to believe that the highly verbal Stanford-Binet intelligence test was not measuring the full breadth of intelligence because it had so few non-verbal tasks (Boake, 2002).

Second, the idea that balancing out verbal and non-verbal tasks is a covert way to eliminate sex differences in average scores is overly simplistic. There are too many exceptions to the rule linking verbal/non-verbal performance to sex for it to be a useful test development guide. For example, men do better on verbal general knowledge tasks, while women tend to do better at non-verbal coding tasks.

Making a standardized test is more difficult than it appears. Well meaning suggestions can undercut the test’s purpose. Image source

Engineering Tests to Eliminate Race Differences

Ironically, while no one seems to have ever tried to eliminate sex differences in average scores, people have tried to use test development to eliminate race differences in average IQ scores. The results are disappointing.

In the early 1980s, the Educational Testing Service was sued because a test they had created for employment screening at an insurance company had average differences in scores, with White Americans scoring higher than African Americans. When the test was used in hiring, it resulted in a disproportionate number of African Americans being rejected for jobs (Golden Rule Life Insurance Company v. Washburn, 1984).

An out-of-court settlement in 1984 led to the what psychometricians call the “Golden Rule procedure,” which is to build a test using items that have the smallest between-group race differences in passing rates (Linn & Drasgow, 1987). This procedure forces tests to have minimal or zero differences in average scores across racial groups.

The Golden Rule procedure sounds like a good idea on paper, but its application has not fulfilled expectations (e.g., Robertson et al., 1977). When tests are created with the Golden Rule procedure, the overall test score’s reliability is reduced, and the test fails to correlate with important criteria (e.g., job performance, educational outcomes). In other words, the items with the highest race differences in passing tend to also be the items that also make the test a useful predictor for behavior in the real world (Linn & Drasgow, 1987; Phillips, 2000). Later attempts to legally impose the Golden Rule procedure on test development have failed, mostly because the Golden Rule procedure would eliminate the test’s usefulness as a tool for decision making (Phillips & Camara, 2006).

Supporting this line of evidence is the research from Spearman’s hypothesis (e.g., Jensen, 1980, 1985; te Nijenhuis & van de Hoek, 2016; Warne, 2016). This evidence consistently shows that the tests and subtests that show the highest average difference between White Americans’ and African Americans’ scores are also the most effective tools for measuring general intelligence (i.e., g). This fact has several important implications (Jensen, 1998; Warne, 2020), one of which being that average IQ differences across these two groups are not an artifact of test creation. Eliminating those differences is possible, but the result will be a test that is no longer an intelligence test. However, there is absolutely no evidence that if someone were to eliminate sex differences that the test would lose any of its validity or predictive power (though no test creators seem to engage in this practice anyway).

Postscript

[NOTE: This postscript was added to the post after the inaccuracies of the original version were made clear.]

The first person to draw my attention to my incorrect understanding was Avraham Eisenberg in two tweets (linked above), who found Jensen (1998, p. 532-533) reporting that items with large differences in passing rates for males and females have been systematically removed from the Stanford-Binet and Wechsler tests. I verified this information in the 1937 Stanford-Binet manual:

. . . it was then possible to plot for each test the curve showing per cent of subjects passing in successive ages throughout the range, also the curve of per cent passing by successive intervals of composite total score on the two forms. This was done for the sexes separately as a basis for eliminating tests which were relatively less “fair” to one sex than the other.

Terman & Merrill, 1937, p. 22

And later in the same volume:

A few tests in the trial batteries which yielded largest sex differences were early eliminated as probably unfair. A considerable number of those retained show statistically significant differences in the percentages of success for boys and girls, but as the scales are constructed these differences largely cancel out.

Terman & Merrill, 1937, p. 34

[NOTE: As I was writing this postscript, Emil Kirkegaard updated his post to reflect more detailed information about this procedure in the 1937 Stanford-Binet.]

Emil Kirkegaard’s response to the original version of this post is worth reading. It includes a quote of Wechsler stating he engaged in the practice of eliminating items that showed large sex differences in performance. However, I have not been able to find any reference to this in the test manuals for the current editions of the Stanford-Binet (Roid, 2003) or the Wechsler tests (Wechsler, 2008, 2012, 2014). If test creators are still eliminating items that show large average sex differences, they are not documenting this information in their technical manuals any more.

Even if sex differences in IQ are minimized in test development process, it does not have the massively detrimental effects on reliability and validity that minimizing race differences in IQ does. An intelligence test is still an intelligence test when sex differences in scores are minimized (though this may hide a difference favoring one sex or the other). The same cannot be said for minimizing race differences in IQ scores. This is strong evidence that the race difference in IQ is driven by a meaningful difference in intelligence and that Murray’s (2021) conclusions about the consequences of mean differences in IQ are viable.

Finally, I want to state that I am grateful that readers were so responsive in giving me feedback. I strive for accuracy, and I appreciate the correction.

References

Boake, C. (2002). From the Binet-Simon to the Wechsler-Bellevue: Tracing the history of intelligence testing. Journal of Clinical and Experimental Neuropsychology, 24(3), 383-405. https://doi.org/10.1076/jcen.24.3.383.981

Golden Rule Life Insurance Company v. Washburn, Settlement Agreement & General Release, No. 419-76 (Ill. 7th Jud. Cir. Ct. Sangamon County, November 20, 1984).

Jensen, A. R. (1980). Précis of bias in mental testing. Behavioral and Brain Sciences, 3(3), 325-333. https://doi.org/10.1017/S0140525X00005161

Jensen, A. R. (1985). The nature of the black–white difference on various psychometric tests: Spearman’s hypothesis. Behavioral and Brain Sciences, 8(2), 193-219. https://doi.org/10.1017/S0140525X00020392

Jensen, A. R. (1998). The g factor: The science of mental ability. Praeger.

Linn, R. L., & Drasgow, F. (1987). Implications of the Golden Rule settlement for test construction. Educational Measurement: Issues and Practice, 6(2), 13-17. https://doi.org/10.1111/j.1745-3992.1987.tb00405.x

Murray, C. (2021). Facing reality: Two truths about race in America. Encounter Books.

Phillips, S. E. (2000). GI Forum v. Texas Education Agency: Psychometric evidence. Applied Measurement in Education, 13(4), 343-385. https://doi.org/10.1207/s15324818ame1304_04

Phillips, S. E., & Camara, W. J. (2006). Legal and ethical issues. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 733-755). Praeger Publishers.

Robertson, D. W., Royle, M. H., Morena, D. J. (1977). Comparative racial analysis of enlisted advancement exams: Item differentiation. National Technical Information Service. Technical Report Accession No. ADA035672 https://apps.dtic.mil/sti/citations/ADA035672

Roid, G. H. (2003). Stanford-Binet intelligence scales, fifth edition, technical manual. Riverside Publishing.

te Nijenhuis, J., & van den Hoek, M. (2016). Spearman’s hypothesis tested on black adults: A meta-analysis. Journal of Intelligence, 4(2), Article 6. https://doi.org/10.3390/jintelligence4020006

Terman, L. M., & Merrill, M. A. (1937). Measuring intelligence: A guide to the administration of the new revised Stanford-Binet tests of intelligence. Houghton Mifflin Company.

Warne, R. T. (2016). Testing Spearman’s hypothesis with Advanced Placement examination data. Intelligence, 57, 87-95. https://doi.org/10.1016/j.intell.2016.05.002

Warne, R. T. (2020). In the know: Debunking 35 myths about human intelligence. Cambridge University Press. https://doi.org/10.1017/9781108593298

Wechsler, D. (2008). WAIS-IV technical and interpretive manual. Pearson.

Wechsler, D. (2012). WPPSI-IV technical and interpretive manual. Pearson.

Wechsler, D. (2014). WISC-V technical and interpretive manual. Pearson.

css.php