Last year, I co-authored an article with my student where we identified the first known publication of the subtests that appear on the Stanford-Binet 5, the WPPSI-IV, WISC-V, and WAIS-IV (Gibbons & Warne, 2019). Much to our suprise, we found that the majority of subtest formats on these popular intelligence tests were created by 1908.

First publication of subtest formats that appear on the Stanford-Binet 5, WAIS-IV, WISC-V, and WPPSI-IV. The median “premiere” date for these subtests is 1908. Source: Gibbons & Warne, 2019, Figure 2.

There are some advantages to this stability. Research on the Flynn Effect is much easier when subtests are stable, and cognitive development is much easier to track. Continuity also has the advantage that these subtests have decades (sometimes more than a century) of validity evidence supporting their use.

On the other hand, the psychometric conservatism of test developers has a downside. Innovative item formats are mostly ignored by the revisers of the most popular tests, and this resistance means that technological innovations are unlikely to be adopted into subtest design. (When the test creators do adopt technology, it is mostly to have a high-tech way to administer old subtests.)

That Which Is Seen, and That Which Is Not Seen in Test Creation

The economist Claude Bastiat encouraged his readers to think of the unseen effects of economic decisions. For example, when considering the cost and benefits of a government policy, it is important to assess that which is seen (its price tag and the tangible consequences) and that which is not seen (the benefits that would accrue from using they money for other purposes). Applied to testing, I think people who create and perform research on intelligence tests should sometimes think about the tasks and subtests we don’t see on intelligence tests, instead of focusing on the ones we do.

This occurred to me when I read Guy Whipple’s (1919) book, Classes for Gifted Children: An Experimental Study of Methods of Selection and Instruction. This is a report of a study of how tests differentiate between gifted and non-gifted children. Whipple and his team gave dozens of tests to two classes of gifted children and two classes of non-gifted children. Much to my surprise, the best tests at distinguishing between the two were a mix of tests that are familiar and unfamiliar to modern scholars.

One example of a familiar test was the “Thurstone Punched Holes” test that Whipple recommends is a variation of Binet and Simon’s (1905) paper cutting test, and a similar test appears on the modern version of the Cognitive Abilities Test (CogAT). Another was the “Bosner Reasoning II, V, and VI” tests, which require examinees to complete a sentence so that it is correct (III), explain why a fact about the world is true (V), and define words (VI). All of these are common verbal reasoning tasks on intelligence, aptitude, and cognitive tests.

But other tests were completely unfamiliar to me as a 21st century researcher. One was the Steacy Drawing test, which required examinees to reproduce a geometric figure on a grid of lines. According to Steacy (1919), the test correlated well with tests of spatial ability, though not enough details are reported for me to determine if the test is a good measure of general intelligence. No one besides Steacy and Whipple has investigated this test, and Steacy’s book is not cited in the scholarly literature after 1928.

Some of the geometric patterns that examinees were asked to reproduce in the Steacy Drawing Test. Examinees were given a small grid of squares to help guide their drawing of each diagram. Source: Steacy, 1919, p. 12.

More intriguing is the “Equivalent Proverbs” test, in which examinees were shown proverbs from different cultures worldwide and had to discern which ones had the same message. Based on the brief description that Whipple (1919, pp. 37-38) provided, it probably required abstract verbal reasoning and the ability to draw inferences. Whipple stated that the test resembled the “interpretation of fables” test, which probably refers to the subtest by that name that appeared on the original version of the Stanford-Binet (Terman, 1916, pp. 290-301). Indeed, some acceptable correct answers to the interpretation of fables subtest were proverbs, such as “Birds of a feather flock together,” and “Don’t count your chickens before they hatch.”

Scholarly work from a century ago is replete with tests that have promise to measure intelligence but which are not in use today. Some of these were empirically investigated and found to be insufficient (like the Porteus Maze Test), but others just seemed to have quietly died out without being investigated. Maybe the Steacy Drawing test and the Equivalent Proverbs test were excellent measures of intelligence, or maybe not. No one knows because a thorough investigation of these tests’ properties and performance never occurred.

While subtests on modern tests do a very good job of measuring intelligence, there is no guarantee that they are the best tasks possible. These subtests are not the product of a competitive Darwinian process where the candidates were all pitted against each other and the best survived to appear on modern intelligence tests. If the Steacy Drawing test and the Equivalent Proverbs test are typical, then some of subtests “died out” for reasons that have nothing to do with their quality. Like many graduate students, Steacy seems to have earned his doctorate and then disappeared completely from the scholarly record. Whipple never published the Equivalent Proverbs test.

So, while I find “what is seen” on intelligence tests to be useful, I find myself asking about “that which is not seen.” Would reviving old tasks to measure intelligence help us measure the trait better? Empirical investigation is the only way to know for sure.


