Sometimes the fish does not realize that it’s wet.
That was my realization this week after I posted a brief description of a study that was recently published.
The study (Schneider et al., 2020) showed that a short training video led a randomly selected experimental group to have IQ scores over 15 points higher than a control group that watched an informational video that did not teach techniques for solving items on an intelligence test. The test was the DESIGMA-Advanced test, which consists of matrix items (like the Raven’s Progressive Matrices), a common nonverbal test formatgiven to measure intelligence.
For someone swimming in the waters of intelligence research, this article was not surprising. Psychologists have known for decades that matrix items are governed by a small number of rules (Carpenter et al., 1990). This makes the tests very susceptible to score increases from practice effects and coaching. Schneider et al.’s (2020) contribution is to show how efficient and effective this coaching can be.
A Seductive (and Simplistic) Interpretation
What I did not anticipate was how people outside of my pond of scientists studying intelligence would (mis)interpret the study. One common reaction was that this was proof that the intelligence tests don’t really measure intelligence or that there was something wrong with the tests.
The logic makes sense on the surface: if intelligence is really a stable characteristic of a person, then it shouldn’t be so easy to manipulate IQ scores. It seems that there could be something very wrong with the tests. The reasoning is very seductive, but it’s overly simplistic.
Echoing James Flynn
The critics of intelligence test are rehashing an old argument from decades ago. James Flynn (1987) reached the same conclusion when he noticed that over time test average scores on intelligence tests had inflated over time (a phenomenon now called the Flynn effect). After observing that the radical rise in scores on the Raven’s (which also consists of matrix items, just like the test in the Schneider et al., 2020, study) did not accompany a similar increase in intellectual accomplishment, he stated:
The Ravens Progressive Matrices Test does not measure intelligence but rather a correlate with a weak causal link to intelligence; the same may apply to all IQ tests.Flynn (1987, p. 187)
. . . psychologists should stop saying that IQ tests measure intelligence. they should say that IQ tests measure abstract problem-solving ability (APSA), a term that accurately conveys our ignorance.Flynn (1987, p. 188)
Because this is not a new argument, scientists studying intelligence have had a generation to investigate the phenomenon of increasing Raven’s scores and understand it better. The paradox of increasing IQ scores on the Raven’s (a test that is supposed to measure intelligence) without a commensurate increase in intelligence has been solved for years. This seems to be another example of critics of intelligence research parroting old arguments without being aware of that the real conversation has advanced.
Not only does the flawed explanation of radical increases Raven’s scores originate with Flynn, but the Schneider et al. (2020) study is best understood in the context of the Flynn effect. When Flynn showed how regular IQ score increases are, one of the surprising results was that the IQ increases were strongest in the tests that contained the least scholastic content–like matrix items. Indeed, one can think of Schneider et al.’s (2020) study as a miniature Flynn effect study where one cohort takes the test under standard (i.e., uncoached) circumstances, while the other benefits from an environment that encourages solving abstract problems. The Schneider et al. (2020) study just makes this process much more efficient by making the coaching explicit and doing it in less than 15 minutes (instead of 30 years).
Matrix items are susceptible to score increases from the Flynn effect or from coaching because its questions are based on a small number of rules. Mastering just a few principles behind the construction of matrix items equips an examinee with the skills to improve their ability to solve every item (Armstrong & Woodley, 2014; Fox & Mitchum, 2013). This is not true of some other types of tests, such as vocabulary or general knowledge tests. This is why verbal tests showed a weaker Flynn effect (Flynn, 1987) and why there have been no increases in verbal ability in industrialized countries for decades (Pietschnig & Voracek, 2015).
Another fact that helps understand Schneider et al.’s (2020) results is to understand that intelligence is not synonymous with IQ, a point I have made before on this blog. I also make the distinction in my new book, In the Know: Debunking 35 Myths About Human Intelligence:
IQ, or an IQ score, is not the same as intelligence or g. Instead, IQ is a measure of general intelligence. To use an analogy, just as kilograms and pounds are measures of weight, IQ is a measure of intelligence. IQ is not intelligence itself any more than the number on a scale is a person’s weight.Warne (2020, p. 12)
IQ is a number that is the result of a mix of influences from intelligence and other sources. (This is true of all psychological tests: they measure a mix of the trait under investigation and other influences.) In Schneider et al.’s (2020) study, the training video is an example of inflating IQ by increasing the non-intelligence influences on the score. This interpretation is supported by the fact that both the experimental group and the control group had almost the exact same correlation between their DESIGMA-Advanced and the scores on an intelligence test that they were not prepped for (r = .53 to .58). Thus, Schneider et al. (2020) were very successful at raising IQ, but the training did not make examinees smarter.
Failing to raise intelligence after training doesn’t just happen in a test preparation context. “Brain training” regimens teach users to perform very well on a few tasks, but this training does not transfer to high performance on dissimilar tasks or raise overall intelligence (Sala & Gobet, 2019; Simons et al., 2016; Stojanoski et al., 2018). In fact, the brain training literature shows the importance of having a wide variety of tasks on an intelligence test: individual tasks can be trained to inflate IQ, but when there are many tasks (some resistant to training) the overall inflation is very small. The Schneider et al. (2020) study shows one of perils of tests (like the DESIGMA-Advanced or Raven’s) that only consist of a single item type. It does not invalidate all intelligence tests, especially tests (like a Stanford-Binet or Wechsler test) that consist of many different types of tasks.
The Flynn effect–which must be a purely environmental effect–is another example of higher IQ that does not correspond to higher intelligence. Research shows that the increase in IQ from the Flynn effect is from the non-g portion of IQ scores and that the g/intelligence contribution has not increased (Lubke et al., 2003; Wicherts et al., 2004). All of these forms of evidence are overwhelming in indicating that IQ and intelligence can be disconnected from one another.
Valid Concerns about Validity
The Schneider et al. (2020) study, the Flynn effect data, and brain training games don’t mean that the Raven’s, DESIGMA-Advanced, and other tests get a free pass. Because IQ can be inflated through coaching and other methods, there are still reasons to be concerned with how tests are used. These concerns are related to validity, which is the degree to which score uses and interpretations are supported by evidence.
Schneider et al.’s (2020) study (and the Flynn effect data) showed that if everyone in a sample gets training to improve IQ scores, then those scores have approximately equal correlations with other measures of intelligence. However, the study also shows that giving some people training and not others weakens the correlation (from r = mid-.50s to the low .40s) when the two groups are combined (Schneider et al., 2020, Table 4). This means that differential access to training does weaken the validity of an interpretation that matrix tests measure intelligence.
Therefore, it is not valid to say that an individual who watched the training video is necessarily smarter than a person who did not. To the extent that training differs across groups, IQ differences may not reflect intelligence differences. This is why the non-g influence of the Flynn effect makes scores from different cohorts incomparable (Fox & Mitchum, 2013; Lubke et al., 2003; Wicherts et al., 2004). However, the mere possibility of differential access to training does not mean that all score gaps between groups are caused by this. Such a conclusion would be an example of Lewontin’s bait-and-switch. If someone wants to argue that average IQ gaps do not reflect average intelligence differences, then they must present evidence that a differential gap in task training is the cause.
I think Schneider et al.’s (2020) study does raise the question of how much training people get, either through schooling or seeking out videos on the internet that teach how to answer intelligence test questions. But their study also shows that training does not drop a matrix test’s score validity to zero. There is every reason to suspect that other tests are less resistant to IQ score inflation.
Apart from lessons about the Flynn effect, the meaning of tests, and other scientific considerations, this episode provides some other lessons. First, it was a good reminder that not everyone swims in the same waters that I do. To me, the disconnect between IQ and actual intelligence level in the Schneider et al. (2020) study was obvious. But it is not natural to interpret the study with that fact in mind. Hence, why my tweet was worded the way it was–and why some people mistakenly understood it as a damning indictment of intelligence tests.
Second, always read James Flynn. Decades ago, he already either made or analyzed almost all of the criticisms that outsiders have about intelligence researchers. The field responded, and now the science is stronger than ever. Repeating the same criticisms from the 1980s is not an effective critique of intelligence tests or intelligence research. Instead, building on Flynn’s work (and the responses to him) is much more productive.
Finally, there is the triviality that nearly got lost in all of the hubbub on social media:
This old wisdom is a good reminder of the obvious: tasks don’t seem challenging when people are given answers or are told how to find them. This fact is almost trivial, but it is important to keep it in mind when reading the studies on test preparation. Coaching may change the very nature of the test; that tells us as much (or more) about the coaching procedure than the test itself.
Armstrong, E. L., & Woodley, M. A. (2014). The rule-dependence model explains the commonalities between the Flynn effect and IQ gains via retesting. Learning and Individual Differences, 29, 41-49. https://doi.org/10.1016/j.lindif.2013.10.009
Carpenter, P. A., Just, M. A., & Shell, P. (1990). What one intelligence test measures: A theoretical account of the processing in the Raven Progressive Matrices Test. Psychological Review, 97(3), 404-431. https://doi.org/10.1037/0033-295X.97.3.404
Flynn, J. R. (1987). Massive IQ gains in 14 nations: What IQ tests really measure. Psychological Bulletin, 101(3), 171-191. https://doi.org/10.1037/h0090408
Fox, M. C., & Mitchum, A. L. (2013). A knowledge-based theory of rising scores on “culture-free” tests. Journal of Experimental Psychology: General, 142(3), 979-1000. https://doi.org/10.1037/a0030155
Lubke, G. H., Dolan, C. V., Kelderman, H., & Mellenbergh, G. J. (2003). On the relationship between sources of within- and between-group differences and measurement invariance in the common factor model. Intelligence, 31(6), 543-566. https://doi.org/10.1016/S0160-2896(03)00051-5
Pietschnig, J., & Voracek, M. (2015). One century of global IQ gains: A formal meta-analysis of the Flynn effect (1909-2013). Perspectives on Psychological Science, 10(3), 282-306. https://doi.org/10.1177/1745691615577701
Schneider, B., Becker, N., Krieger, F., Spinath, F. M., & Sparfeldt, J. R. (2020, 2020/09/01/). Teaching the underlying rules of figural matrices in a short video increases test scores. Intelligence, 82, Article 101473. https://doi.org/https://doi.org/10.1016/j.intell.2020.101473
Sala, G., & Gobet, F. (2019). Cognitive training does not enhance general cognition. Trends in Cognitive Sciences, 23(1), 9-20. https://doi.org/10.1016/j.tics.2018.10.004
Simons, D. J., Boot, W. R., Charness, N., Gathercole, S. E., Chabris, C. F., Hambrick, D. Z., & Stine-Morrow, E. A. L. (2016). Do “brain-training” programs work? Psychological Science in the Public Interest, 17(3), 103-186. https://doi.org/10.1177/1529100616661983
Stojanoski, B., Lyons, K. M., Pearce, A. A. A., & Owen, A. M. (2018). Targeted training: Converging evidence against the transferable benefits of online brain training on cognitive function. Neuropsychologia, 117, 541-550. https://doi.org/10.1016/j.neuropsychologia.2018.07.013
Warne, R. T. (2020). In the know: Debunking 35 myths about human intelligence. Cambridge University Press. https://doi.org/10.1017/9781108593298
Wicherts, J. M., Dolan, C. V., Hessen, D. J., Oosterveld, P., van Baal, G. C. M., Boomsma, D. I., & Span, M. M. (2004). Are intelligence tests measurement invariant over time? Investigating the nature of the Flynn effect. Intelligence, 32(5), 509-537. https://doi.org/10.1016/j.intell.2004.07.002