In my previous blog post, I wrote about 10 fraudulent scholarly articles and book chapters by Stephen E. Breuning that were unretracted over 30 years after the fraud was exposed. I have contacted the editors of the journals where the articles were published and asked them to retract. Hopefully, they will do quickly so that the scientific record can be cleaned up.
But this is not the entire story. When the National Institute for Mental Health (NIMH) investigated Breuning’s work, they only investigated the research that was funded by NIMH. This is an understandable choice, because NIMH was interested in determining whether their research funds had been misused. However, from the perspective of ensuring that the scholarly record is accurate, it is inadequate. Unless one believes that Breuning only decided to commit fraud on his work that was funded by NIMH, it is necessary to investigate Breuning’s other work. For this reason, I have investigated Breuning’s articles in my areas of expertise — intelligence research and educational psychology — that the NIMH investigation did not look into. What I have found makes it likely that Breuning’s fraud started while he was a graduate student.
Four Likely Fraudulent Studies
I have identified four studies published by Breuning between 1978 and 1981 which have characteristics that make it likely that the studies are fraudulent. Some of these characteristics are found in many fraudulent studies, while other characteristics are suspicious because they also appear in Breuning’s 1980s work.
I am focusing on these articles because they are in my areas of expertise (educational psychology and intelligence research), and I am qualified to evaluate them in a detailed manner. All four are cited by later researchers in a positive manner, usually with no hint that the findings may be questionable. I present them below in the approximate order of their publication.
This article reports two experiments. In the Experiment 1, the sample consisted of 328 high school students from 12 classes taught by 6 teachers (2 classes per teacher) in 5 suburban Chicago schools. In the first phase of the study, all classes were taught with the traditional teaching procedure. After 12 weeks, half of the classes (one per teacher) were switched to “precision teaching.” In the “precision teaching” condition, students were given (at least one week in advance) three assignments per week, due Monday, Tuesday, and Wednesday of each week. On those class days, students would have a normal class experience for 30 minutes and then take a quiz. Students who scored 90% or above on all three quizzes “were allowed to do as they pleased, within practical limits” (Breuning, 1978a, p. 128) on Thursday and Friday during class time. Students who did not pass the quizzes had review days on Thursday or Friday. This phase lasted another 12 weeks. Students in the traditional teaching and precision teaching groups took tests every 3 weeks of both phases and a retention test approximately five months after Phase II was completed. Students in the precision teaching group in Phase II performed better on the regular tests (two-thirds scoring above 90%) and the retention test (approximately 15 percentage points higher) than students who experienced the traditional teaching condition.
Experiment 2 occurred after the end of the second 12-week portion of the first experiment. Breuning (1978a) supposedly identified 39 students who did not respond to precision teaching. Most of these — 24 of the 39 — stated that they were not completing their assignments because of their after-school employment. For 21 of these students, a program was implemented where the student could not work a weekend shift at their job until they had completed their Monday assignment. On weekdays, students could not work until their work for the next school day was complete. Students who performed well on their quizzes were allowed to work extra hours. The students’ average grade on the tests increased from 47.5% to 83.7% — going from an F to a B, on average. No student answered fewer than 76% of questions correctly on the retention test.
Based on the description in the Breuning (1978a) article, I am extremely doubtful that this study occurred. Here are the reasons for my skepticism:
- Even though the same teachers used the precision teaching and traditional teaching method for each class during Phase II of Experiment 1, there was no treatment diffusion. In other words, none of the teachers started using the precision teaching techniques in their regular classes, even though the differences in class performance were large and noticeable. In fact, by the fifth week of Phase II, the precision teaching classes had to receive extra material (Breuning, 1978a, p. 127), yet the much higher performance in the experimental classes did not spurn any teacher to adjust how she or he taught their control group class. This is unusual when a highly beneficial intervention is easily available to the control group.
- The details of the administration of the retention test are contradictory. On the one hand, teachers were informed of the test only two days before it was administered (Breuning, 1978a, p. 130), indicating that the test was probably given at the end of the school year (five months after the end of Phase II of Experiment 1). On the other hand, only 55.8% of students took the retention test because many students had already gone to college or moved away (p. 130). This indicates that the retention test may have been given in the next school year. This is an irreconcilable contradiction. If the test was given in the same school year, a lot more students would have taken it; if it was given the next school year, then it would not be necessary to inform most of the teachers in the study because their students would have advanced to the next class or grade.
- Breuning (1978a, p. 130) found five volunteer teachers to examine every test question that was in common across groups and rated whether the grading was consistent. Conservatively estimating that each test had 30 questions in common across groups, this is 240 test questions for 328 students that these volunteers checked for consistent grading. On average, each volunteer teacher would be checking 15,744 responses for consistency in grading. Despite their herculean efforts, none of these volunteers are thanked in the acknowledgements.
- The results are extremely “clean” and show that the impact of precision teaching is consistent across classroom subject. Moreover, the overall results and the results for each subject show the same interaction where precision teaching raised the test scores of the top two-thirds of students but not the bottom third of students (Breuning, 1978a, p. 131). Getting consistent results in subgroups and overall results is odd enough; replicating an interaction in 5 different subgroups is even more unusual, especially when the interaction is called “unexpected” (p. 131) and the subgroups would have much lower statistical power for detecting any interaction.
- The employers of the students in Experiment 2 are unbelievably accommodating. In addition to checking on the students’ homework at the beginning of every shift, they also met with the student’s teacher every Friday (Breuning, 1978a, p. 135). Additionally, employers all agreed to allow a student to start a shift late if they had not finished their homework and to receive extra hours if they performed well on quizzes (p. 135)! It is amazing that employers’ staffing needs were so flexible and both employers and teachers had the time to meet every week.
- Experiment 1 took 24 weeks to conduct (not including the retention test), and Experiment 2 took another 12 weeks to conduct. Assuming that vacation weeks (e.g., spring break, Christmas break) are not counted in these totals, then Experiment 2 would have to begin almost immediately after Phase II of Experiment 1 ended. This is because 36 full weeks of school is 180 days. Given the preparation to identify students and prepare Experiment 2 (e.g., designing the intervention, getting permission from employers, obtaining informed consent from parents), it is highly unlikely that both studies reported in Breuning (1978a) could be completed before the school year ended. This isn’t even taking into account the fact that most education studies do not start on the first day of school and do not end on the last day to avoid activities that would disrupt the study (e.g., students transferring out of classes, mandated final examinations, end-of-school activities and trips, etc.).
- Breuning (1978a, pp. 138-139) claims that the study was replicated, and he cited two manuscript that he supposedly coauthored with one of his professors, Dr. Paul Koutnik. However, when interviewed by the NIMH investigatory panel, “. . . Dr. Koutnik said he knew nothing of any research associated with Dr. Breuning’s involvement with the schools . . .” (p. 42; see also pp. 238-239). These supposed replications were never published.
- The Breuning (1978a) study raised suspicions at the time. Brophy (1978) found it odd that Breuning did not find an ability-treatment interaction, where high-achieving students benefit more from self-directed teaching methods than average students. Such interactions were well known by the 1970s (Cronbach, 1975) and are one of the most consistent interactions in educational psychology. Brophy also found it odd that so many students (21 of 24, or 87.5%) who stated that their low grades were due to their work commitments “readily volunteered” (Breuning, 1978a, p. 139) for Experiment 2. Finally, Brophy (1978, p. 142) thought both of Breuning’s experiments were “expensive in time, if not money,” which raises the question of how a graduate student could have run such a study alone while also working on animal research. Breuning (1978b) wrote a response to Brophy’s (1978) commentary, but these issues are either ignored or incompletely addressed.
There are other aspects of the study that normally wouldn’t make me very suspicious. However, in an article that already is likely fraudulent, these characteristics raise further questions:
- Breuning (1978a) never stated how these teachers were trained or who trained them in precision teaching. Given that Breuning’s only K-12 teaching experience was as a student teacher in a high school biology class (NIMH investigatory panel, p. 24), he was not qualified to give any teacher training.
- The precision teaching portion of the study is planned for a 5-day school week. It is never explained what the procedure is when students have one or more days off during the week.
Breuning & Zella (1978)
Breuning and Zella (1978) reported a study of three samples (total n = 485) of high school students enrolled in 19 special education classes in Chicago area. The students took one of three tests: the Otis-Lennon Mental Ability Test (n = 129), the Lorge-Thorndike Intelligence Test (n = 147), or the Wechsler Intelligence Scale for Children-Revised (n = 209). Students were randomly assigned to either re-take the test under standard conditions (for the control group) or with the incentive that if they performed better, they could have a reward. The incentives were student-specific and identified by interviewing people who knew the student to ask what the student. Typical incentives included “record albums,” tickets to sporting events, and a portable radio.
Students in the control condition had their scores increase by 4.1 IQ points (d = .27), which is a plausible practice effect. Examinees with incentives had their scores increase by 20.9 IQ points (d = 1.39), with the increase being 19.7 to 23.8 IQ points (d = 1.31 to d = 1.43), depending on the test. This IQ increase was large enough for the weighted average IQ in the experimental group to be 102.9, while the weighted average IQ for the control group was 81.9.
The article also reports a follow-up study of 200 students who had IQ scores between 98 and 120 and 100 students with IQs between 121 and 140. These students were also “randomly assigned” to “approximately equal” sized groups. In this replication (which is reported very briefly in a single paragraph), students with more average IQs had increases of 3-7 IQ points (in the incentive group) or 1-4 IQ points (in the control group). For the high-IQ group, the score increases were just 1-5 points for both groups.
There are several characteristics in the study that make me skeptical that it actually happened:
- The NIMH investigatory panel stated found evidence that is highly unlikely that Breuning conducted any large-scale research in Chicago area schools. The panel found (p. 24) that Breuning’s primary experience in K-12 schools during his graduate education was working as a student teacher in a high school sophomore biology class for a year (p. 24), which was required for Breuning to obtain a teaching certificate (p. 42). Breuning also conducted workshops on classroom behavior management (p. 24) and did “contractual work in the school system, measuring the positive reinforcement effect of rewards on school performance of special education students” (p. 238). (Note that this statement does not indicate that Breuning conducted any research or that IQ scores were an area of concern for this work.) Neither dissertation committee member whom Breuning claimed could vouch for his Chicago-era research knew anything about any studies he was conducting on human subjects (p. 24). In fact, according to the dissertation chair, “Dr. Breuning was expected to devote all of his research time to his dissertation research” (p. 24), which was on goldfish. The educational psychologist on Breuning’s commitee, Dr. Paul Koutnik, told the NIMH investigatory panel that he, “. . . knew nothing of any research associated with Dr. Breuning’s involvement with the schools . . .” (p. 42; see also pp. 238-239). Given the massive scope of the study reported by in the Breuning and Zella (1978) article, it seems extremely unlikely that it could be conducted by a graduate student without the lead author’s professors knowing.
- The time window for this study to occur is extremely narrow. Breuning worked at South Suburban Chicago Schools Project from March 1976 through December 1977 (see p. 152 of the NIMH panel’s report). The manuscript for the article was received by the journal on December 9, 1976. This means that the study would have needed to occur between March 1976 and the first week of December 1976. Breuning would have about 9 months to plan the study and the follow-up study of 200 children who were not in special education, execute both studies, analyze the data, write the manuscript, and submit it to the journal. Given the magnitude of the study — conducting and/or supervising interviews, coordinating administration and scoring of intelligence tests, supervising the reliability check, finding volunteers to help with the study, etc. — it is extremely unlikely that Breuning could have done all of this while also working on his research on animal learning. There simply isn’t enough time to do this study on IQ while also meeting the research demands of his graduate program, especially if — as his dissertation committee chair stated — he was supposed to be focusing on his animal research. Reporting a study as occurring too quickly to be realistic is in character for Breuning. When Robert Sprague originally reported Breuning’s misconduct, one of the things that had made Sprague suspicious was that Breuning had reported studies as occurring too quickly to be realistic (Sprague, 1993).
- The balance of freshmen, sophomores, juniors, and seniors is “approximately equal” in each class (Breuning & Zella, 1978, p. 221). This is highly unusual in studies of high school students. Because some students drop out, in most representative high school samples, there are fewer seniors and (to a lesser extent) juniors than freshmen and sophomores.
- The study also had “approximately equal” numbers of males and females. This is a red flag because males outnumber females in special education programs and in low-IQ samples (Warne, 2020).
- To determine what incentives would be most motivating for students, a series of “. . . interviews were conducted with the student, the student’s friends, the student’s parents, and the student’s teachers” (Breuning & Zella, 1978, p. 222). Similar interviews were conducted for students in the control group, though apparently those students’ teachers were not interviewed. Therefore, at least 4 interviews per participant were conducted for children in the experimental group and at least 3 interviews per participant were conducted for every child in the control group. That means that there were at least 4(65 + 73 + 104) + 3(64 + 74 + 105) = 4(242) + 3(243) = 968 + 729 = 1,697 interviews conducted to learn what incentives would motivate each child. This is simply an extremely implausible number of “in person” (Breuning & Zella, 1978, p. 223) interviews to conduct, especially when a paper-and-pencil survey would suffice (or even a brief phone call to each interviewee).
- For a study led by a graduate student with no acknowledged funding source, this was an extremely expensive study to conduct. The article stated (Breuning & Zella, 1978, p. 223) stated that the cost of a child’s incentive was “typically under $25.00” In 2022 dollars, $25 in 1976 would be worth $123.53. If a conservative estimate is that the average incentive value was $20 in 1976 ($98.82 in 2022 dollars), then there must have been a budget of 242 x $20 = $4,840 for incentives. In 2022 dollars, that translates to a budget of $23,914.87. Who picked up the tab for this? Miraculously, the parent(s) of 226 (93.3%) of the students agreed to buy the incentive(s) for their child, and the schools agreed to pay for the incentives for the other 16. Anyone who has done education research would know how absurd this claim is. Parents just don’t pay ~$100 for their child to participate in research run by a graduate student, and schools are unlikely to pick up the tab for any participant. (Indeed, standard procedure in most social science research is to pay participants for their time — not to expect them to pony up to be part of a study.)
- The generosity required for the study extends beyond financial incentives. There were a massive number of volunteers mentioned in the study, though none are thanked in the acknowledgements. Volunteers supposedly conducted some of the incentive interviews (how many is not clear) and performed reliability checks to see whether the intelligence tests were scored properly. Six school psychologists — who were otherwise uninvolved with the study — volunteered to rescore at least 42 tests each (see below). Breuning was a graduate student at the time this study supposedly occurred, and specialized in neither educational nor school psychology. (He wrote his dissertation on animal behavior.) It is not clear how he had the time or connections to find and coordinate so many volunteers. While Breuning did do some work in suburban Chicago schools while in graduate school, the faculty member who arranged his placement in the schools “. . . knew nothing of any research by Dr. Breuning involving human subjects and that he had not been involved in arranging for Dr. Breuning to conduct such research (NIMH investigatory panel, p. 24; see also pp. 42, 238-239).
- The test administrators were the school psychologists who administered the sample members’ original intelligence tests (Breuning & Zella, 1978, p. 222). Apparently, these school psychologists had nothing better with to do during working hours than re-test dozens of students (an average of 69.3 students, assuming one school psychologist per school building) to help a graduate student with his study. Assuming about 2-3 hours per test per student, this is 3-6 weeks per psychologist of testing time, assuming that this study is the psychologist’s only task during working hours.
- Some students who took the WISC were retested with the revised version of the WISC (the WISC-R). Because of the Flynn effect, the score increases for students who took the WISC-R should have been attenuated, a fact stated in the article (Breuning & Zella, 1978, p. 221, Footnote 1). Yet, the scores increases for the experimental group on the WISC/WISC-R were slightly larger than average (21.5 IQ points). Additionally, the change of test version should have greatly nullified any score gains from practice effects for the control group. Yet, that control group gained 3.9 IQ points (Breuning & Zella, 1978, p. 224).
- The effect size for the intervention is much larger than almost any other study on the impact of motivation on IQ scores. The effect size of d = 1.39 is 10 times larger than the average reported effect size in other studies (see discussion of Duckworth et al., 2011, below). Outlier effect sizes should always make people suspicious of a study.
- The follow-up study on 300 high school students with IQs of 98 or higher is odd. Why did these children already have IQ scores? Even if they already had scores (e.g., perhaps intelligence testing was standard procedure in Illinois schools at the time), there would still be another 150 children in this new experimental group, and it would be necessary to conduct interviews to learn incentives to offer, parents to finance those incentives, administer and score the tests, etc. If conducting this study once seems unlikely, conducting it twice really strains believability.
- The results are far too consistent compared to real data: “. . . with the sample of students in the present study, sex, class standing, ethnic status, and socioeconomic status were equally affected by the incentive condition” (Breuning & Zella, 1978, p. 225).
Additionally, there are aspects of the study that are suboptimal. These would not normally indicate fraud, but they seem extra suspicious, given the possibility of fraud in this study:
- The number of tests rescored for the reliability check is inconsistent. The article states, “A random selection of 252 pretests and 252 posttests . . .” were rescored (Breuning & Zella, 1978, p. 223). Later in the paragraph, the number of tests from each group sums to 252, not 504. If I thought that this study really had occurred, then I would chock this up to a reporting error.
- Another inconsistency that could be a minor reporting error is that the p. 223 states that there were 242 children in the intervention group, but the group sizes (from p. 221) sum to 243. In a normal article, I would just assume that this is a typo.
- The article states (Breuning & Zella, 1978, pp. 220, 221) multiple times that examinees were randomly assigned to experimental or control groups for each test. In other words, students taking one test were randomized, and students taking the other tests were subjected to separate randomizations, for a total of three randomizations. For the Otis-Lennon test, 64 students were in the control group and 65 were in the experimental group. For the Lorge-Thorndike test, the group sizes were 74 (control) and 73 (experimental). Finally, there were 105 students in the control group and 104 students in the experimental group who took the Wechsler Intelligence Scale for Children (WISC). The fact that all three tests had such similar numbers of students in each group from three separate randomizations is extremely unlikely, with a probability of .002. In other words, to have three groups of these sizes each divided by pure random assignment into two group is about 1 in 500. It is far more likely that the group size differences would be larger for at least one pair of groups than what was reported.
- Another miraculous randomization happened in the reliability check. Tests were randomly selected (Breuning & Zella, 1978, p. 223), and within each test, equal numbers of tests were selected from each group to be rechecked: 38 tests from the experimental group and 38 tests from the control group for both the Otis-Lennon and Lorge-Thorndike tests, and 50 WISC or WISC-R tests from the experimental and control groups. The probability of three pairs of groups having the same size is .000662 (i.e., less than 1 in 10,000). If I were giving the authors the benefit of the doubt, I would state that these matching group sizes during the randomization process may be due to a misuse of the terms “randomly assigned” and “random selection.”
- It is really odd that one of the most common incentives that high schoolers wanted was “aquarium set up” (Breuning & Zella, 1978, p. 223). Are there a lot of teenagers who want an aquarium? Were aquariums really popular in the late 1970s? I don’t know, but it just seems like a strange detail.
This was the article that drew my attention to the Breuning scandal and has had the most impact of the four articles I investigate in this blog post. According to Google Scholar, it has been cited 29 times. Where its influence has been greatest, though, is in a 2011 meta-analysis published by Duckworth et al. In that meta-analysis, the researchers found that giving examinees incentives to perform well increased scores on intelligence tests by an average of 9.6 IQ points (d = .64). The Breuning and Zella (1978) data were three of the four largest samples in the meta-analysis and totaled 24.2% of all sample members in the meta-analysis. Moreover, the Breuning and Zella (1978) samples were three of the four largest effect sizes in the Duckworth et al. (2011) meta-analysis.
If Breuning’s data in the Breuning and Zella (1978) article are fraudulent, then the consequences for the Duckworth et al. (2011) are monumental. The large effect sizes, combined with the large sample sizes in Breuning and Zella’s (1978) article, mean that these samples have a disproportionately strong influence on the final result in Duckworth et al.’s (2011) meta-analysis. I estimate that without Breuning and Zella’s (1978) samples had been omitted, Duckworth et al.’s (2011) average effect size of providing incentives to examinees would be d = .13, instead of d = .64 (i.e., 1.95 IQ points, instead of 9.60 IQ points). This would mean that the average effect size (the most important finding in a meta-analysis) is almost 5 times higher than it should be — a difference large enough to warrant a major correction to an article.
Additionally, Duckworth et al. (2011) found no evidence of publication bias in three out of four tests they performed. However, Lasker (2020) found that without Breuning and Zella’s (1978) data, the evidence for publication bias becomes overwhelming. When I calculated the correlation between the unweighted correlation between the effect size and the inverse standard error of each sample, the correlation with the Breuning and Zella (1978) data was r = -.09 (indicating no publication bias); without the suspicious data, the correlation is r = -.40 (indicating a strong likelihood of publication bias). This means that even the effect size of d = .13 (or 1.95 IQ points) likely overstates the causal impact of increasing motivation in intelligence test examinees.
Additional evidence against the Breuning and Zella (1978) article
In my investigation of this article, I contacted the second author of the study. Retired after a productive career in hospital administration, Zella was helpful in telling me his recollections about this article. Although he doesn’t remember everything, some things are clear:
- Zella had little involvement with the study. When I asked if he planned or ran the study, Zella’s answer was an emphatic no. Zella also was not involved with administering intelligence tests or conducting interviews. He told me that he “did something with stats” and “helped with writing up the project,” which Breuning then revised and “updated.” Zella told me that he agreed to help Breuning at the time because Zella believed that having a publication on his resumé would help with the job hunt, and Breuning seemed busy with a lot of research. This is consistent with Breuning’s modus operandi reported by the NIMH investigatory panel, which stated that “. . . Dr. Breuning induced others, who sometimes had little or no involvement into coauthorship” (p. 36). Breuning also listed people as coauthors without their knowledge (p. 47).
- Another fact that Zella told me that is consistent with Breuning’s fraudulent behavior was that Zella has no recollection of seeing the raw data from the study. This was a common characteristic of Breuning’s work. The NIMH panel stated several times in its report that Breuning routinely did not show raw data to his coauthors (pp. 25, 28, 29, 33, 36, 47, 177, 180, 193, 194, 204, 213, 230, 231). Breuning’s explanation (pp. 250-251) was that his coauthors did not ask, which is probably true. But that means little. When Robert Sprague did ask Breuning for raw data from a different study, Breuning was unable to produce it, which got the investigation into Breuning’s research started (Sprague, 1993). Indeed, Breuning was never able to produce raw data for any of the studies that the NIMH panel investigated.
- Others have commented that Breuning’s coauthors are among his victims (Wysocki & Fuqua, 1990). I agree that Zella is certainly a victim of Breuning’s fraud. The scientific enterprise relies on trust. Readers must trust authors. Participants must trust experimenters. And coauthors must trust each other. My research and my discussion with Zella have led me to conclude that Breuning violated that trust and that Zella was one of many innocent victims in this affair.
- When I asked Zella if he would support a request for a retraction, he stated that he would. In Zella’s view, the circumstantial evidence that I have compiled creates strong doubts that the study occurred. If Breuning’s coauthor cannot vouch for the veracity of the study, I think that no journal editor or reader should stand by it, either.
Breuning & Regan (1978)
In this study, Breuning and Regan (1978) supposedly examined whether a new teaching procedure, coupled with motivational incentives, would help high school students in special education classes perform at a typical level of academic accomplishment.
The sample consisted of 125 special education students in 4 biology classes and 4 English classes in the Chicago area. Four classes (two biology and two English) were assigned to two intervention groups. In a baseline phase, both groups experienced the teachers’ typical lesson plans. All teachers were given training to create performance objectives and write study guide questions. Class sessions were 2 hours long, which was divided into four 30-minute blocks. In order, the students spent each half hour (1) working on the day’s assignment, (2) engaging in class discussions and reviewing assignments, (3) working on the next day’s assignment in small groups or individually, and (4) taking a ~20-question quiz based on the day’s study guide.
Initially, both groups experienced an initial baseline phase in they received a typical teaching strategy. In Phase 1, one randomly selected group received the experimental intervention first while the other group remained in a non-incentive control group. In Phase 2, the groups switched so that the original control group received incentives and the original experimental group reverted to a control condition of teaching without motivational incentives. In Phase 3, both groups experienced the intervention to increase motivation via incentives. At the end of all four phases, the experimenters administered a retention test to measure how much material the students had learned.
The incentive was that students could earn up to 60 minutes of optional activity time (from the last half of the class session) by earning high scores on the day’s quiz. During this time, the classroom was divided in half, and students with high enough quiz scores enjoyed free time in the “lounge/game room type area” (Breuning & Regan, 1978, p. 183), while the other students worked in the other half of the classroom.
Like the Breuning and Zella (1978) article, there are several aspects of the Breuning and Regan (1978) article that make me doubt that the study occurred:
- For the same reasons that the Breuning (1978a) and Breuning and Zella (1978) studies likely never occurred, the Breuning and Regan (1978) study probably never happened. As stated above, the NIMH panel found (p. 24) no corroborating evidence of any human subjects research that Breuning conducted in Chicago area schools. His work in the K-12 schools did not include research, and he was supposed to be concentrating his time on his dissertation research (on goldfish) during the relevant time.
- Even if a reader wants to give Breuning the benefit of the doubt and believe that he could have performed research that his professors were unaware of, the description of the school and classroom is coincidental. It is amazing that the school was “randomly selected” (Breuning & Regan, 1978, p. 182) and that it happened to have a way to easily partition its classrooms into two sections — just as the study required. A formal partition would definitely be needed to reduce distractions for the studying students, especially because playing music on a phonograph was a popular amenity in the lounge area, and, “Many students brought in their own albums to play during the optional activity time” (p. 183). Any teacher trying to teach English or biology to a struggling student would want to greatly reduce or eliminate the inevitable noise coming from the lounge area.
- The quiz and reward were out of order. In order to be able to have up to 60 minutes of free time as a reward, students would need to take their quiz in the first hour of a class session. Yet, Breuning and Regan (1978, p. 182) reported that quizzes were taken in the last half-hour of class.
- The teachers worked remarkably fast at implementing the intervention. For each week, teachers would have to write performance objectives for every class period, a total of 200 study guide and quiz questions, and also grade quizzes for every student in every class session. All of this work took just 90 extra minutes of preparation time per week! Even if they only used this 90 minutes for quiz writing, these teachers would need to write a question every 27 seconds!
- As a reliability check, five volunteer teachers were asked to evaluate the consistency of quiz grading. Each volunteer evaluated this consistency for 2,000 quizzes that had approximately 20 questions each. In other words, these volunteers teachers, out of the goodness of their heart, agreed to examine 40,000 quiz questions for a couple of graduate students. Where did these people find the time? Why were they so willing to volunteer? How did Breuning find them?
- As the figure below shows, the results were dramatic and immediate. When a phase would change, quiz scores would almost always double immediately, even though students in the motivational incentive part of the study are receiving up to 50% less instructional time. Moreover, when a group moves back to the control experience (with no motivational incentives), all improvements in academic performance are lost immediately. This means that students did not use any skills they learned from the motivational incentives when those incentives were removed. This is extremely unusual in this type of study in educational psychology.
- The figure also shows that for the retention test (taken in the 15th class session in Phase 3), there is a clear difference between the topics students had mastered in different phases. However, when comparing performance on two equal phases, performance is equal. This means that students remembered earlier material and later material about equally well and that there was no identifiable learning decay over time. This is extremely unusual; research going back to Ebbinghaus in the 19th century shows that people remember more information learned recently than material learned earlier. But that wasn’t the case in the Breuning and Regan (1978) study.
There are other aspects of the study that seem fishy, given the oddities listed above. These characteristics would not normally indicate fraud, but they certainly undercut the credibility of the study:
- The total sample size is reported as 125, but the class sizes in Table 1 (Breuning & Regan, 1978, p. 181) sum to 124. When the racial/ethnic composition of the sample is reported, the number of males sums to 65, instead of the 74 that is reported earlier in the article (p. 181). If I were inclined to give the benefit of the doubt, then I would chock this up to sloppiness.
- The authors claimed that, “A more structured and closely monitored parent involvement program is currently underway and preliminary data indicate a further enhancement of student classroom academic performance” (p. 186). This study was apparently never published. At the time or publication, the authors were working full-time about 290 miles away from Chicago, and it’s not clear how they would have performed such a study from another state while working in an inpatient facility for children with severe disabilities (and not the K-12 schools).
- This study’s theory is based heavily in the results of the Breuning and Zella (1978) study that I believe never occurred. Breuning and Regan’s (1978) introduction states that the prior study “showed” that IQ was highly susceptible to motivational incentives. The latter study was designed to demonstrate that academic achievement was, too. If the earlier study is fraudulent (as I believe), then it increases the likelihood that this one is, too. It also undercuts the theoretical rationale for the Breuning and Regan (1978) study, even if it did occur.
- Finally, there is the real possibility that Regan himself may have denied the existence of this study. The NIMH investigatory panel into Breuning’s misconduct stated that:
Dr. Regan said he had conducted experiments with goldfish with Dr. Breuning at Oakdale, but that he knew nothing of any research with human subjects done by Dr. Breuning.NIMH investigatory panel (p. 43)
The wording is ambiguous; it is not clear whether Regan was unaware of any human subjects Breuning conducted at Oakdale (in Michigan, which would have occurred after the article was published) or in general. Still, it leaves open the possibility that Breuning never conducted any human subjects research that Regan was aware of, which would mean this article reports a study that never occurred. Even though Regan is a coauthor, this is plausible because (as stated above) the NIMH investigatory panel found that “. . . Dr. Breuning induced others, who sometimes had little or no involvement into coauthorship” (p. 36). Breuning also listed people as coauthors without their knowledge (p. 47).
Breuning & Davis (1981)
The article by Breuning and Davis (1981) reports a study of the impact of reinforcement for responses on intelligence test performance. Forty individuals with an intellectual disability took intelligence tests under standard procedures and, as expected, had low IQ scores (M = 39.4, SD = 11.6). The group was then divided into four subgroups of 10 individuals each. Two of these took the test again under standard procedures (M = 41.1, SD = 10.4 and M = 39.9, SD = 10.3). One group, though, received a small reinforcement immediately after each correct response, which caused their scores to increase (M = 59.9, SD = 10.2). The final group was reinforced for incorrect responses, which lowered their scores (M = 29.8, SD = 11.7). Examples of reinforcers were a “. . . drink of pop, piece of a cracker, jelly-bean . . .” (Breuning & Davis, 1981, p. 309). These rewards were consumed immediately after given to the examinee (Breuning & Davis, 1981, p. 310).
The four groups were further subdivided so that half were either took the test again under standard conditions or in being reinforced for correct responses or incorrect responses.
|Order of Test Conditions||Test 1 Mean (SD)||Test 2 Mean (SD)||Test 3 Mean (SD)|
|Standard-Standard-Standard||42.0 (8.0)||41.6 (6.3)||42.0 (6.3)|
|Standard-Standard-Correct Reinforcement||38.8 (14.5)||40.6 (14.2)||57.6 (9.8)|
|Standard-Correct Reinforcement-Standard||36.2 (13.9)||56.0 (10.0)||60.2 (8.9)|
|Standard-Correct Reinforcement-Correct Reinforcement||41.0 (11.7)||63.8 (9.7)||72.0 (9.0)|
|Standard-Standard-Standard||39.6 (15.6)||40.4 (14.2)||40.4 (14.6)|
|Standard-Standard-Incorrect Reinforcement||40.0 (5.2)||39.4 (6.0)||31.8 (5.0)|
|Standard-Incorrect Reinforcement-Standard||37.0 (17.5)||27.6 (14.0)||26.6 (13.6)|
|Standard-Incorrect Reinforcement-Incorrect Reinforcement||40.6 (9.7)||32.0 (9.9)||23.0 (6.8)|
|Note: All groups n = 5|
As is apparent, when examinees were given reinforcement for each correct answer, the IQ increased. When they were reinforced for giving incorrect answers, the IQ scores decreased. Standard administration (i.e., without reinforcement for any responses) led IQ scores to be roughly the same as the previous test administration.
As with the other two articles reported in this blog post, there are several hints that the Breuning and Davis (1981) article reports a study that never occurred:
- Of the 40 participants, 23 (57.5%) were female and 17 (42.5%) were male (Breuning & Davis, 1981, p. 308). This is unusual because 55-70% of individuals with an intellectual disability are male. Moreover, the gender imbalance gets more severe for as the mean IQ for a sample decreases (American Psychiatric Association, 2013). With the overall sample having a mean IQ of 39.4, it is highly unlikely that the sample would be majority female.
- Assignment to the groups was “random” (Breuning & Davis, 1981, p. 309). And yet, pairs of intervention groups had exactly the same number of people taking each of three intelligence tests (p. 309). Again, such perfectly evenly split groups in a random assignment procedure is extremely coincidental.
- The intelligence test administrators “. . . were unaware of the particular hypothesis of the study . . .” (Breuning & Davis, 1981, p. 309). However, this is not possible because the intervention procedure required reinforcing examinees after each correct or each incorrect answer (pp. 309-310), which is definitely not standard procedure for administering intelligence tests. (In standard administration, examinees are not told whether they have answered correctly or not, and any praise is for effort. Experienced intelligence test administrators have a great poker face because they do not want to reveal any hints about performance to the examinee.) Thus, the administrators had to be told how to implement the intervention and to record the resulting IQ scores. The hypothesis would have been obvious. Even if they had not been explicitly told the hypothesis, the test administrators would have been able to figure it out.
- Breuning and Davis (1981) are contradictory about the awareness of other staff members in regards to the study. Supposedly, the test scores were added to each participant’s clinical record (Breuning & Davis, 1981, p. 310). But in the same paragraph, it states that, “All program staff and ancillary program personnel were unaware of any aspect of the study” (Breuning & Davis, 1981, p. 310). How is that possible when the scores are added to each patient’s record? Additionally, wouldn’t the staff and psychologists notice that these patients were being tested way more often than usual (three times in 19 weeks)?
- Just as in the other articles I describe in this post, Breuning was able to find a lot of volunteers for this study. Ten volunteers administered the intelligence tests, while one volunteer psychologist observed random test administration sessions through a two-way mirror (Breuning & Davis, p. 310). Three other volunteer psychologists conducted reliability checks of the data (p. 310). None of these volunteers are thanked in the acknowledgements, and it is unclear how Breuning found so many volunteers — especially for the portion of the research done in Coldwater, Michigan, which had about 9,000 residents at the time.
- As with the other articles examined for this post, the results are extremely unusual compared to other research on the topic. According to Breuning and Davis, “Under the reinforcement condition on the second test, the number of responses reinforced steadily increased as the test progressed. This resulted in scores being largely determined by increased performances on the latter sections of the test” (1981, p. 314). This is a VERY odd result because it would indicate that the test items’ susceptibility to reinforcement is a product of the their position in the test — and not their content, difficulty, or g-loadedness (which measures how well a task measures general intelligence). Later, when not reinforced for their responses, the examinees got these items wrong despite answering correctly earlier (p. 315). This is a VERY odd practice effect, and I have not been able to find a similar practice effect anywhere in the research on intelligence testing. [Update: My colleague, Dr. Joni Lakin of the University of Alabama, pointed out that intelligence test items almost always increase in difficulty, so it is extremely hard to believe that examinees with such low scores can get increasingly difficult items correct just by eating a jelly bean.]
- Breuning and Davis (1981) interpreted their results in terms of behavioristic theory: “Thus, even though there was virtually no score change when the standard condition followed the reinforcement condition, the intratest patterns of responding were reversed. If presented graphically, the two intratest response patterns would be similar to stereotypical acquisition and extinction curves, respectively” (p. 315). In other words, Breuning’s results are what one would expect from a classical conditioning study where a reinforced response becomes more frequent and a previously reinforced response becomes less frequent as reinforcement ceases. However, classical conditioning applies to studies where the stimulus and the response do not change across trials (or are very similar across trials). That’s not what happens in an intelligence test, where the items (i.e., stimuli) differ and the desired responses differ, especially when the item format changes from one subtest to another. Score changes after reinforcement of responses should not resemble an acquisition or extinction curve at all.
- The results are too “clean.” Even though the sample size was only 40, and the smallest subgroups were only n = 5 each, the results were consistent, with very little noise in the data. Moreover, when comparisons were made for each test, each test subsection, and when data were “broken down by age, sex, and institution” (Breuning & Davis, 1981, p. 318), the results were the same. This is extremely unusual, especially for small sample sizes.
Again, there are other characteristics which add to my suspicions regarding this article but are less conclusive evidence about the possibility of fraud:
- The study supposedly occurred at the Coldwater Regional Center for Developmental Disabilities in Coldwater, Michigan, and at the University of Pittsburgh. These locations are where the majority of Breuning’s known fraudulent research supposedly occurred. However, the fact that other studies he claimed happened at these locations were fraudulent does not prove that this study is, too.
- There were 57 people in the sample, but consent could only be obtained for 40 (Breuning & Davis, 1981, p. 308). Coincidentally, this balanced design required a sample size that was exactly a multiple of 8, and the number of sample members with consent matched that desired sample size exactly. Possible? Sure, but there was only a 1 in 8 chance of that happening.
If the four articles are indeed fraudulent (as I believe all of them are), then Breuning’s scientific fraud began when he was in graduate school and had already been underway for years when he started fabricating studies funded by NIMH. Taken together, these articles show a pattern of irregularities in Breuning’s research in educational psychology and intelligence. Methodological irregularities across multiple articles are as follows:
- All four studies would have required massive amounts of manpower and time — far more than was at Breuning’s disposal as a graduate student (for the studies published in 1978) or at Coldwater (for the Breuning & Davis, 1981, study). Breuning claims in all these articles that a small army of volunteers helped with this work (including reliability checks in three of the studies: Breuning & Davis, 1981; Breuning & Regan, 1978; Breuning & Zella, 1978). None of these volunteers are acknowledged by name in the articles, and it is not clear how Breuning would have found so many volunteers with expertise in education, school psychology, test administration, and other areas.
- The articles all report very strange practice effects and patterns of recall that go against what normally appears in human learning (Breuning, 1978a; Breuning & Davis, 1981; Breuning & Ragan, 1978; Breuning & Zella, 1978). Indeed, one of the articles had results that were so out of the ordinary that it spawned a comment from another scientist (Brophy, 1978).
- All four studies report huge effects from relatively simple interventions. These large effect sizes violate Warne’s First Law of Behavioral Interventions, which states, “Brief, subtle, or weak interventions will produce brief, subtle, or weak changes in human behavior” (Warne, 2020, p. 163). Across these articles, Breuning claims he can raise IQ by dozens of points (Breuning & Davis, 1981; Breuning & Zella, 1978), almost completely eliminate the need for special education services (Breuning & Regan, 1978; Breuning & Zella, 1978), and greatly reduce underachievement (Breuning, 1978a; Breuning & Regan, 1978). If something seems too good to be true, it probably is.
- Three of the four articles (Breuning, 1978a; Breuning & Regan, 1978; Breuning & Zella, 1978) reported studies that would need to occur while Breuning lived in Chicago (between September 1974 and December 1977). During this time, he would have needed to coordinate massive numbers of volunteers, designed and executed complex studies, and analyzed data — all while also taking graduate courses, conducting completely unrelated research on goldfish in his graduate program, and writing and defending a dissertation. And for a year of that time, he also worked as a student teacher and performing “contractual work in the school system” (NIMH investigatory panel, p. 238). It is simply unbelievable that a graduate student could do all of this with no previous training or research in educational psychology or intelligence testing, no connections to schools, and no prior published research on human participants. Indeed, Breuning’s reports of extreme research productivity was suspicious enough to Robert Sprague to request the investigation that led to Breuning’s downfall (Sprague, 1993).
- For three of the studies (Breuning, 1978a; Breuning & Davis, 1981; Breuning & Zella, 1978), the results were far more consistent than real data would likely be. Additionally, when Breuning’s theory required an interaction between the intervention and a subject characteristic, it always showed up (Breuning, 1978a; Breuning & Zella, 1978), even though interactions tend to replicate at much lower rates than main effects.
- Two of the articles (Breuning, 1978a; Breuning & Zella, 1978) claim very briefly that replications were performed that confirmed the main study’s findings. However, these replications are not reported elsewhere, and no data about them are available. One coauthor of two supposed replications claimed to know nothing about Breuning’s research on humans (NIMH investigatory panel, pp. 238-239).
- In two studies (Breuning, 1978a; Breuning & Davis, 1981), the demographics of the sample are very unusual, and report far fewer males than would be expected from a low-IQ sample.
- In two of the studies (Breuning, 1978a; Breuning & Zella, 1978), adults in the students’ lives — parents, employers, teachers — were remarkably cooperative and willing to invest significant time and/or money into the success of the study.
- For the Breuning and Zella (1978) and Breuning and Davis (1981) studies, students were randomly assigned to conditions, but the sample sizes were too balanced and equal for randomization to be likely.
Apart from the methodology, these articles have another commonality. The message of these articles is that the right motivation incentives and reinforcement program can eliminate low IQ, special education placement, and low academic achievement in the vast majority of people who experience them. Indeed, Breuning’s philosophy about these intransigent issues can be summed up with a quote from the era:
. . . academic performance is largely due to an interaction between teaching procedure and incentive motivation.Breuning (1978b, p. 147)
Breuning seems to have had a great deal of faith in the power of behaviorism to solve the problems of low IQ and low academic achievement. All the articles report the vast majority of students in special education or individuals with low IQ can function much more normally if they just receive the correct incentives and reinforcement schedule. The only evidence to support this idea is from Breuning’s articles.
Just as I did with the unretracted articles that the NIMH panel found were fraudulent, I am going to ask the editors of the journals that published these four articles to retract them. This process can be slow, though, and some editors are hesitant to retract, even when the evidence is clear that an article is fatally flawed. I hope for the best, though.
There are six more articles published between 1977 and 1985 with Stephen E. Breuning as first author that neither I nor the NIMH panel investigated. NIMH did not investigate them because they were not funded by the agency. I have not investigated them because they are outside of my area of expertise (i.e., goldfish studies, psychopharmacological studies). I hope that someone else will look at other parts of Breuning’s research corpus and help determine whether the scientific record needs further cleaning up.
American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). American Psychiatric Publishing.
Breuning, S. E. (1978a). Precision teaching in the high school classroom: A necessary step towards maximizing teacher effectiveness and student performance. American Educational Research Journal, 15(1), 125-140. https://doi.org/10.3102/00028312015001125
Breuning, S. E. (1978b). Notes and comments: Reply. American Educational Research Journal, 15(1), 145-147. https://doi.org/10.3102/00028312015001145
Breuning, S. E., & Davis, V. J. (1981). Reinforcement effects on the intelligence test performance of institutionalized retarded adults: Behavioral analysis, directional control, and implications for habilitation. Applied Research in Mental Retardation, 2(4), 307-321. https://doi.org/10.1016/0270-3092(81)90026-6
Breuning, S. E., & Regan, J. T. (1978). Teaching regular class material to special education students. Exceptional Children, 45(3), 180-187. https://doi.org/10.1177/001440297804500304
Breuning, S. E., & Zella, W. F. (1978). Effects of individualized incentives on norm-referenced IQ test performance of high school students in special education classes. Journal of School Psychology, 16(3), 220-226. https://doi.org/10.1016/0022-4405(78)90004-3
Brophy, J. E. (1978). Notes and comments: Precision teaching in the high school classroom: A commentary. American Educational Research Journal, 15(1), 141-143. https://doi.org/10.3102/00028312015001141
Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30(2), 116-127. https://doi.org/10.1037/h0076829
Duckworth, A. L., Quinn, P. D., Lynam, D. R., Loeber, R., & Stouthamer-Loeber, M. (2011). Role of test motivation in intelligence testing. Proceedings of the National Academy of Sciences, 108(19), 7716-7720. https://doi.org/10.1073/pnas.1018601108
Lasker, J. (2020, June 21). Motivation and “IQ”. RPubs. https://rpubs.com/JLLJ/duckworth
Sprague, R. L. (1993). Whistleblowing: A very unpleasant avocation. Ethics & Behavior, 3(1), 103-133. https://doi.org/10.1207/s15327019eb0301_4
Warne, R. T. (2020). In the know: Debunking 35 myths about human intelligence. Cambridge University Press. https://doi.org/10.1017/9781108593298
Wysocki, T., & Fuqua, R. W. (1990). The consequences of a fraudulent scientist on his innocent coinvestigators. JAMA, 264(24), 3145-3146. https://doi.org/10.1001/jama.264.24.3145