**Has Stephen Jay Gould’s The Mismeasure of Man really been discredited?**

44 min readJul 9, 2019

A century and a half ago the most learned and conscientious of men believed that the size of the human skull was a measure a person’s intelligence, and that the various sub-species of human being, of which there were multiple, could be objectively ranked into a hierarchy of cranial, and hence intellectual capacity. Half that time, a mere 75 years ago, although not unchallenged this belief remained widespread within the scientific community; albeit crude and ineffective psychometric testing had by that time replaced the physical measuring of skulls.

But why did scientists once believe such silly ideas? Did a fair and objective interpretation of the evidence available at the time support such conclusions, only for new evidence to come along and paint a radically different picture?

Well, no. At least not according to the late evolutionary biologist and science communicator Stephen Jay Gould, who argued in his 1982 book The Mismeasure of Man that the evidence simply had never been on the side of the skull-measurers, literal or figurative. It isn’t that these early scientists acted dishonestly while modern scientists posses greater integrity, indeed many of these early scientists practised a degree of transparency in reporting their data that sadly surpasses the present-day norm. Instead, argued Gould, such scientists’ judgement had been clouded by their socially-conditioned expectation that the hierarchies of race and class, of which they found themselves near the top, were the product of natural rather than social forces. As a result, even the most meticulous and honest of these scientists collected and interpreted their data in subtly biased ways.

The finding, extensively replicated throughout the 19th and early 20th centuries, of a natural racial hierarchy, emerged not as the result of some objective property of the data but subjectively from the minds of the scientists and the cultures in which they lived.

This, for Gould, is no mere curiosity of history. Instead Mismeasure is written as a cautionary tale, not because a few cranks continue to pedal these silly ideas (although sadly they do), but as a warning against overconfidence in the cold, disinterested objectivity of modern day science. Scientific conclusions were, and still are shaped not only by evidence but also by the attitudes, assumptions and expectations of individual scientists, the scientific community and wider society.

In order to have some hope of overcoming this bias, scientists must first be aware of what their biases are. Be upfront about them, don’t pretend not to have an interest. What scientist has ever been truly indifferent about the results of an experiment?

But not everyone agrees with Gould. These critics argue that the scientific method is robust against the influence of personal prejudices, and that Gould is mistaken in the many examples he cites that suggest otherwise. And they claim to have new data that proves it.[1]

One recent study by Russell Warne and his colleagues, written up in Quillette by Warne himself, concludes that Gould was incorrect in his finding of bias in the work of early twentieth century psychologists who used data from their World War One Army mental tests as evidence of racial differences in innate intelligence. Warne alleges in his Quillette article that Gould was committed to ideology over data, that he was mounting a “postmodernist” attack on scientific objectivity and accuses him of “lying for social justice” by “distorting evidence” and presenting a “deceptive analysis,” all the while comparing him to a religious fanatic.

These allegations follow similar (though slightly more sober) accusations by a team of physical anthropologists in 2011 led by Jason Lewis, who sought to counter Gould’s assertion that “unconscious manipulation of data may be a scientific norm” because “scientists are human beings rooted in cultural contexts not automatons directed toward external truth” by accusing Gould of making false claims in a chapter of his book relating to 19th century craniometry.

While Warne’s recent attempt received some buzz among the usual suspects, Lewis’s original allegations received substantial press attention — such that although they would soon be thoroughly discredited[2], the damage to Gould’s reputation had already been done. Any google search for Gould’s name in connection with this book will yield dozens of results claiming that Gould was a fraud.

In this essay I will argue that Gould’s hypothesis is correct, both in the general sense that scientists’ socially conditioned expectations often influence their collection and interpretation of data, and more specifically that the two main examples in contention are clear-cut cases of racial bias.

I will further argue that not only are the critics wrong in their assessment of Gould, both critiques are so lacking in substance that they amount to nothing more than attempts at character assassination thinly veiled as science.

And along the way we’ll have fun exploring some of the more farcical episodes in the history of scientific racism.

There’s More Than One Way to Measure A Skull

One of the best known sections of The Mismeasure of Man concerns the skull-measuring antics of Samuel Morton, a 19th century physician and father of American anthropology.[3]

Throughout his life, one of his main research interests was developing an empirical justification for Louis Agassiz’s theory that the races of man were created by God separately, rather than sharing common ancestors.

Morton collected skulls from across North and South America that had belonged to members of various ethnic groups. He also received shipments of mummified remains from ancient Egyptian tombs. Morton had each skull measured by filling its interior with a granular material. At first seed was used, but Morton switched to using lead shot when the seed-based measurements proved unreliable (prone to giving inconsistent results with remeasurement). Little refinements to his own methodology like this one, felt Gould, speak to Morton’s credit as an empiricist who wanted to make his case honestly.

On the basis of these measurements he calculated mean values of skull size for each race, with Caucasians on top, followed by Mongolians (by which he meant all East Asians) and then by Americans and Ethiopians (by which he meant sub-Saharan Africans and their diaspora). These calculations were used to support the notion of inherent differences in racial intelligence.

Noting that the racial ranking of Caucasian subgroups and Africans in modern America was consistent with those found in the ancient Egyptian tombs, he interpreted this as evidence in support of Agassiz’s theory that the races had always been separate.

In a 1978 paper, upon which the relevant section of The Mismeasure of Man is based, Gould re-analysed Morton’s lead-shot data which he accepted was likely “objective, accurate and repeatable.” There was nothing wrong with the final measurements of the individual skulls, rather various sources of bias had crept into Morton’s analysis of these measurements.

See, each racial group could be divided up into various subgroups. The Caucasian group, for example was divided up into English, Germans, Celts, Semites etc. The average skull size for these sub-groups in Morton’s data vary considerably, and overlap with each of the other racial groups. Under such conditions the relative weighting of each sub-group becomes hugely important.

Forget skulls for a moment. Which are bigger, dogs or cats? Most breeds of dog are larger than most domestic cats, but tigers are even bigger still. If sample of cats were heavy on tigers but light on domestic cats, while our sample of dogs is heavy on chihuahuas but light on St Bernards, then we could calculate that cats are, on average, bigger than dogs. Conversely, if our sample for cats was dominated by domestic cats and had only a couple of tigers but our sample for dogs had wolves, dingoes, St Bernard’s, labradors and beagles all represented in similar proportions, we would find that dogs are, on average, bigger than cats.

The average size of a dog or a cat simply isn’t a well defined thing. No amount of measurements of any degree of accuracy and precision will change that. And whatever data you have, you can always choose a method of calculation that will give you your desired result (no weighting, equal weighting or weighting proportional to their relative populations at some arbitrary time and place).

We encounter a similar problem when attempting to construct racial averages from Morton’s data. His sample for Native Americans was dominated by Peruvian Inca, who jointly had the smallest skull sizes of any of the sub-groups in Morton’s collection, weighing down the Native American average.

But it’s not as though Morton was unaware of how his results could be distorted by unrepresentative subgroups. The smallest subgroup in his collection for Caucasians was Indians (described as “Hindoos”). In contrast with his calculation of the Native American average, Morton took the decision to exclude most Indian skulls from his calculation of the Caucasian average on the grounds that he considered them too small to be representative.

Gould documents many arbitrary decisions like this one which, taken on their own, might almost be justifiable, but which considered together reveal a clear pattern: whenever a subjective judgement call was required, Morton consistently took decisions that favoured a high Caucasian average and lower American and African averages.

We like to think that science is impersonal, objective and empirical. But the reality is raw data often contains anomalous results and outliers and miscalibrated readings. Not every datapoint ever recorded gets used in the final analysis. In order to avoid results that are dominated by noise and systematic error, scientists need to use their own judgement. And whenever somebody uses their own judgement, there’s room for their expectations to creep into play; the level of scrutiny a result receives is proportional to how surprising it is.

In his reanalysis of Morton’s data, based on adopting an equal weighting for each subgroup and excluding small samples that were mostly of a single sex (men have larger heads than women), Gould showed that Morton’s own data produced racial averages with no statistically significant difference between them. The final result of any scientific study depends not just on the raw measurements made, but on the method used to interpret them.

An appropriately topical example both of the role the human judgement plays in interpreting data and of how seriously the threat of expectation bias is taken in well-conducted science comes from the Event Horizon Telescope team who recently published the first ever image of a black hole.

A key challenge of the Very Long Baseline Interferometry technique used is that there are an infinite number of reconstructed images that are compatible with the same telescope data; including an infinite number that look like plausible images of a black hole and an infinite number that would be highly surprising, but which are not entirely physically impossible.

To make things worse, the unusually small number of telescopes in the array posed a problem for the conventional algorithm used to process radio telescope data.

“The dangers of false confidence and collective confirmation bias are magnified for the EHT because the array has fewer sites than typical VLBI arrays, there are no previous VLBI images of any source at 1.3 mm wavelength, and there are no comparable black hole images on event-horizon scales at any wavelength.” (The Event Horizon Telescope Collaboration, 2019)

As Dr Bouman explains in the above video, this traditional approach is also highly sensitive to human choices. Part of the solution involved adopting a Bayesian approach to imaging, alongside the the traditional one, but Bayesian algorithms are highly sensitive to their initial inputs. Essentially, you need to give the algorithm an image representing what you expect the object will look like, and the algorithm will iterate interpretations of the data looking for one that best meets these criteria.

Obviously both these approaches have huge room for expectation bias to prejudice the results.

For that reason, the remainder of the imaging process was basically the worlds most elaborate multi-step exercise in bias elimination. It involved a blinding process, with four separate teams working in parallel, not allowed to share results with each other.

Four different ways of interpreting the same measurements. You can never entirely remove the role of human judgement in science.

As Dr Bouman explains, they also fed their algorithm different simulated measurements, to make sure it wouldn’t spit out an image of a black hole whatever data you feed it, and also trained their algorithm on different target images to make sure that the final image they were getting wasn’t just an artefact of a particular target image.

I recommend checking out Dr Bouman’s lecture at Caltech to get a feel for the potential threat posed by expectation bias in modern science and the shear amount of effort that goes into attempting to minimize it.

Returning to skulls, Gould also suggested that his expectation bias hypothesis could explain the large increase in the Black average between Morton’s original results (using seed based measurements) and final results (using lead-shot based measurements). Seeds are easily compressible, they don’t pack evenly and they vary in size. Measurements using seed were less consistent than those using lead shot, and although some individual skulls were measured to be smaller using seed while others were measured to be larger, on average seed produces smaller results for skull volume for each race. But the increase in volume when moving from seed to shot was not equal for each race. The Caucasian skulls were 1.8 cubic inches larger when measured by shot. The Native Americans were 2.2 cubic inches larger. And the Black skulls were a whopping 5.4 cubic inches larger when measured with the more accurate shot rather than with seed.

Gould comments,

“Plausible scenarios are easy to construct. Morton, measuring by seed, picks up a threateningly large black skull, fills it lightly and gives it a few desultory shakes. Next, he takes a distressingly small Caucasian skull, shakes hard, and pushes mightily at the foramen magnum with his thumb. It is easily done, without conscious motivation; expectation is a powerful guide to action.”

So, what do Gould’s critics have to say about this?

Morton’s Defenders

Well, in 2011 Jason Lewis and his colleagues set about remeasuring the skulls in Morton’s collection, and concluded that Morton’s final measurements (those conducted with lead shot) were generally accurate, proving that Gould was wrong.

Yeah.

If you’ve already spotted the problem with this criticism of Gould then give yourselves a pat on the back, you’ve already done a better job than half of the science writers who’ve written about this.

The problem, of course, is that Gould’s entire reanalysis was based on assuming that Morton’s raw data was accurate, at least once he switched to lead-shot. The accuracy of the measurement of individual skulls was never in contention.

Gould’s only conjecture with regard to bias in measurement concerned the earlier seed-based measurements that Morton himself rejected. Nevertheless Lewis claims to have proven Gould wrong by measuring the skulls and showing that the results more-or-less agree with Morton’s shot-based measurements.

No unbiased person who bothered to check Gould’s original paper could possibly have fallen for it.

The authors then go on to respond to something Gould actually did say. Using the Native American skulls, for which we have the most complete data, Lewis shows that the average increase in skull size moving from seed to lead-shot measurements was indeed 2.2 cubic inches as reported by Gould, but also observed that some individual American skulls decreased in size when measured with the more accurate lead-shot. This, he claims, “casts significant doubt on the hypothesis that mismeasurements with seed were a function of Morton’s racial bias.”

Of course all it really suggests is that random error is a bigger influence than bias on an individual skull basis, but the thing about random errors is they cancel out when you have lots of them. That’s why it’s called random error. Once the random errors have cancelled out, the remaining difference is the bias.

Lewis’s paper responds to various other claims that Gould never made, such as the suggestion that Morton intentionally used different ratios of males to females to skew his results. In reality, Gould only claimed that the unequal representation of males and females had the effect of skewing the results, not that Morton some how did this by design.

This is followed with the assertion that as a result of Gould, “Morton is now viewed as a canonical example of scientific misconduct.” A claim that seems odd, given how Gould was concerned only with unconscious bias and went out of his way to emphasise how he didn’t consider Morton to be guilty of any intentional misconduct. Indeed, anyone who comes aware from reading Mismeasure with the impression that Morton was a fraud has not only misunderstood the relevant chapter but has somehow failed to pick up on the book’s core theme.

At this point you might be starting to think this paper is less of a rebuttal and more of a hatchet job, and, well. . . yes. If at any point you find yourself spending your summer days locked away in the basement of an Ivy League university, literally measuring skulls to prove that a 19th-century white supremacist was right all along by refuting an argument that nobody is making, then something has gone horribly wrong in your academic career.

Though for the sake of fairness I should point out that Lewis and his co-authors do manage to land a couple of valid, if nit-picky, criticisms. At one point Gould speculated that the increase in the number of skulls in the Peruvian subgroup from 23% of the total American group to 50% was responsible for a decline in the mean American skull size in between earlier and later publications. Lewis counters this conjecture by noting that for this particular calculation Morton weighted the subgroups equally, so that a change in their relative proportion could not be responsible for the decreasing mean.

Additionally Gould remarks at one point that if you remove the subgroups with a sample size of three or less from the American skulls used in Morton’s first book, not only does the difference between the races narrow considerably, you actually end up with an average American skull size one cubic inch larger than the average Caucasian skull size reported in Morton’s final catalogue. Lewis correctly points out in his critique that this is no longer true once you include all the American skulls in Morton’s final catalogue rather than just the ones used in his first book.

But neither of these minor corrections undermine Gould’s central thesis: that bias was responsible for Morton including or excluding certain subgroups in a manner that led to the results he was expecting and that his original seed-based measurements of the skulls were similarly influenced by this bias. Not a single criticism in this paper calls into question Gould’s argument.

Still, regardless of the shortcomings of this specific paper, it is worth considering the possibility that there is something to its central accusation: that Gould himself was highly motivated to find that Morton was wrong, to find equality where it did not exist. And to this I say, yes. Of course he was.

Not only would Gould not deny this, in this very book he pointed out an example of how a bad xerox combined with his own expectation that his recalculation of the Caucasian mean would be significantly lower than Morton’s, led to an error in the original paper that this section of the book is based on.

This humble acknowledgement of his own bias makes clear just how absurd is a claim by Russell Warne in his Quillette article that,

“Gould was very much like the Marxist or postmodernist who believes that invisible power structures control every aspect of life — but who must somehow show that the postmodernist is special in her ability to escape the influence of these structures just long enough to see and resist them, thanks to their extraordinary intellectual courage and perspicacity.”

Clearly Gould did not imagine himself to be uniquely immune to the effects of expectation bias, despite what his critics imagine.

Nor is it clear what precisely about Gould’s argument is supposed to be “postmodernist.” My best guess is it’s just a condition of being published in Quillette that you need to find a way to blame postmodernism or “grievance studies” at least twice. After all, I agree with Gould’s hypothesis and I’m not a postmodernist. I only read Discipline and Punish because I thought it might be kinky.

The reality is that the problem of expectation bias is widely acknowledged as something that needs to be actively guarded against. And not just in the social sciences, or by philosophers or sociologists of science, but also in cutting edge particle physics.

The Bandwagon Effect

Some scientists respond to the accusation of bias as though they’ve just been personally insulted. To say that somebody’s judgement is influenced by their expectations and that their expectations are influenced by the society in which they live is not to say that that person suffers from some character flaw or to accuse them of intentional misconduct.

It may, perhaps, be helpful to consider examples from a less politically charged area of enquiry. Richard Feynman, one of the pioneers of quantum field theory, notes in his well-known essay on “Cargo Cult Science” the example of the history of measurements of the electron charge. Robert Millikan achieved this in 1909 despite the technological limitations of his age by placing charged drops of oil in an electric field and then toggling the strength of this field until the upward electric force on the droplet balanced out the downward gravitational force, allowing him to calculate the charge on each drop. By performing this experiment repeatedly he collected data on the charges of a large number of drops and noticed that the value of this charge was always a whole number multiple of the smallest charge, which was the charge on a single electron.

Millikan’s oil drop result was simple, it was elegant, and it was wrong. His calculated result was too low quite a wide margin.

But this isn’t a story about why Millikan was wrong. What’s interesting is what happened next. When other scientists attempted to replicate his measurements they didn’t, as you might expect, end up reporting the correct value, or end up with values randomly distributed either side of the true value. Instead each subsequent measurements was slightly greater than the one which came before it, but not so much greater as not to concur with the immediately previous result. This pattern continued for decades until the reported results finally stopped rising, converging on the value we now know to be correct.

And this is not an isolated occurrence. The same thing happened with measurements of Hubble’s constant. Other times, instead of a smooth curve we observe a sudden jump where expectation bias had caused scientists to produce results that concurred with the consensus until somebody finally manages to get a paper published that challenges this, at which point new results will tend to concur with this new consensus.

Henrion, Max & Fischhoff, Baruch. (1986)

This is precisely what happened with the speed of light.

Twice.

It also happens almost every time we discover a new particle. In fact this Bandwagon Effect, as it’s come to be know, is such a widespread phenomenon that the Particle Data Group have taken to publishing historical plots of particle properties over time in the hopes of raising awareness of the problem.

We imagine that science, being such a competitive process, is self-correcting. That if one person makes a mistake others will surely have every incentive to catch it. But in reality that happens only rarely. Expectation bias is so powerful that the norm isn’t for science to self-correct, but for scientists to follow their predecessors down a dead end.

We’ve reached the point today that many physicists are starting to introduce blinds into their experiments in an effort to lessen the impact of bias.

A good example of this is Fermilab’s Muon g-2 experiment to measure the anomalous magnetic moment of the muon. In order to prevent expectations influencing the results of their data analysis, the clock used in the experiment is set to a frequency that is then kept secret from the scientists tasked with analysing the data. When they present the results of their analysis they wont know whether the magnetic moment they’ve calculated conforms to their expectations or not until after it’s been descrambled.

Fermilab even produced a Youtube video to explain why they’re doing this:

Looks like even Fermilab has been corrupted by the postmodernists.

The moral of the story? Nobody is free from bias, and those who most loudly insist otherwise are those it shackles most tightly.

A scientist’s expectations weight heavily on their interpretation of their data. And some expectations are influenced by the prevailing social and political hierarchies of the day.

The moment in which we stop seeing objectivity as something to strive toward and start taking it for granted is the moment in which subjectivity strikes its most savage blow.

The fact that this bias is so prevalent in modern particle physics demonstrates just how unreasonable it is to suppose that it must be absent from century-old race science.

With the pervasiveness of expectation bias, even in the hardest of sciences, firmly established, let’s finally move on to the subject of this most recent criticism by Warne and his colleagues.

How Not to Measure Intelligence

On the 6th of April 1917 the United States declared war on Germany in response to the latter’s policy of unrestricted submarine warfare. On that day the US Army numbered 127,151 soldiers, more than an order of magnitude fewer than any other major power in the First World War. In the Battle of Verdun alone, the French and German Armies each suffered three times more casualties than than the United States had soldiers. To put it simply, if the US was serious about joining the fighting on the Western Front, they were going to need more men.

Over the next year and a half the US Army would recruit or conscript four million men. Men who needed camps, rifles, uniforms, rations, hospital beds and training. One small part of this logistical and organisational nightmare was the question of how to adequately assess the capabilities of new recruits for job assignment and the selection of officer candidates. To help them solve this problem the army commissioned Harvard psychologist Robert Yerkes to devise a new testing protocol[4].

The idea of intelligence testing wasn’t new; variants on the Binet scale had been in use for two decades at this point; but unlike the machine-graded, multiple choice IQ tests of today, these early tests were administered in a lengthy one-on-one session by a trained specialist, who was also responsible for interpreting the examinee’s answers to the test’s often open ended questions. Attempts at mass marketing these tests had led to great inconsistency, clearly unsuitable for the kind of industrialised testing regime Yerkes had in mind for the army.

To this end he set about developing a test that could be administered to entire barracks full of new recruits at once, and quickly graded by an examiner following a manual. Because many of the new conscripts were illiterate or spoke limited English, the test came in two formats: the Alpha, a test consisting of a series of written questions that necessitated a written answer, was administered to those recruits who were literate in English, while a second test, the non-verbal Beta, was administered to those recruits who were not literate in English. Each test-taker was assigned a mental age based on their results, the forerunner of the modern notion of IQ.

Like earlier intelligence tests, the Army Alpha and Beta were intended to assess cognitive ability as independently of learned knowledge as possible. Like some but unlike others among his predecessors, Yerkes insisted that his tests accurately and reliably measured innate intelligence, and were not significantly affected by level of education received, culture, health, nutrition or any other environmental hardship.

After the war Yerkes began using the massive amounts of data he had collected as the basis for his scientific research, publishing an analysis of the data in book after the war. This book claims that different races and ethnic groups had different levels of innate intelligence, concluding that white American adults had an average mental age of 13, while black Americans had an average mental age of 10.4. The various white ethnicities could also be ranked, with British and German migrants scoring more highly than Italian or Polish ones.

In the fifth chapter of The Mismeasure of Man, Gould criticises Yerkes’ racial conclusions on two grounds. Firstly, the administration of these tests was, to use the modern term, a complete omnishambles. Test administrators complained of so many men being cramped into one room that men at the back could not hear instructions.

Men who were illiterate in English or who failed the Alpha test were supposed to take the Beta test. But the sheer volume of test-takers put this system under pressure, with different camps adopting wildly different thresholds for scores on the Alpha that would see a recruit asked to take the Beta. While your performance on the Alpha in one camp may see you immediately assigned a low grade, in another, less stretched for time camp that same performance on the Alpha would see you reassigned to take the Beta, on which you may go on to receive a high grade. To make matters worse, the Beta was so oversubscribed that in many camps recruits who should have taken the Beta were sent to the Alpha instead, with the threshold for failure artificially lowered. As time went on and the logjam worsened, camps simply stopped reassigning those who failed the Alpha to sit the Beta.

The result was many less literate men with scores at or near zero. This introduced a systemic bias at the expense of Black Americans and recent migrants, who were more heavily represented among those illiterate in English.

The conditions under which the test was administered were so inadequate and so inconsistent that even had they used the least biased tests imaginable it would be impossible to draw meaningful conclusions from the data.

The second of Gould’s objections is that the Army mental tests, especially the Beta test, were so poorly designed that even had they been administered under ideal conditions they would have failed not only to measure unchanging innate intelligence, but to measure anything beyond educational experience and familiarity with (white) American culture.

The recipients of the Beta test, having completed few if any years of primary schooling, would often have never even seen a test before. It is likely that many had no experience working with pen or pencil, suggesting those with more years of schooling likely had an advantage.

Since they could not read and many had only poor English, instruction for each section of the test were given by a mixture of voice and mine-artistry followed by a demonstrator completing a set of example questions.

Warne uploaded a Youtube video in which he and a colleague helpfully demonstrate this practice.

Imagine you’ve never been to school, you’ve never read a book and never taken a test before. Somebody hands you a pencil, and gives you only this demonstration to let you know what you’re supposed to do. And you don’t know how much time you have to think about your answer.

You’d probably do a far bit worse than someone with a few years of schooling who’s more familiar with this kind of abstract problem solving.

Many test sections on the Beta had more questions than could be answered within the time limit, and recruits (who were disproportionately Black or immigrant) were not told what the time limits were for each section. At one point Gould recounts an anecdote about giving the Army Beta test to his students at Harvard, noting that even these elite students were unable to finish many sections. He also notes that a surprisingly large minority of his students received a lower-average mark on the test, which doesn’t support the notion that the test reliably measures intelligence.

Worse still, many of the questions seem to have a doubtful relationship with general intelligence. The Alpha, for example, expects test-takers to be able to identify that Christy Mathewson is a famous baseball player, and that John Adams was the second President of the United States. While the Beta expects test-takers to know the correct location of a US postal stamp, that light bulbs have a filament, that tennis courts have a net, that phonographs have a horn and that the jack of diamonds playing card has a big diamond in the top-left corner.

The most brazen cultural bias on the Beta comes from subtest 6, in which candidates are required to draw whatever is missing to complete the picture.

The picture above, for example, is missing strings.

While this one is missing bowling balls.

Now, let’s see if you can do some. What would you say is missing from the picture below?

The house is missing a chimney. Too easy?

Okay, how about this one?

Yeah, this one got me too. Do they want a corkscrew? A nail file? A Swiss flag? Nope. Turns out its a rivet.

While the bias in subtest six is most obvious, the bias 4 and 5 may be all the more damning.

We are asked to believe that questions like these aren’t sensitive to degree of literacy.

Requiring recruits to be able to recognise and work with numerical symbols. It seems implausible that those unfamiliar with reading and writing numbers would have had a fair shot at these tests, and Yerkes own completion statistics bear this out:

Ah yes, the infamous bimodal distribution of intelligence distribution.

These histograms seem to bear out the idea that test takers had very different experiences depending on their familiarity with reading numbers, those who understood were able to make a meaningful attempt at the test, while the many test takers who simply had no idea what they were supposed to be doing ended up clustering around zero.

The Alpha test had even bigger problems…

…owing to the large numbers of recruits incorrectly assigned to sit the Alpha.

Yerkes calculated averages for subgroups based on hookworm infection and no. of years residence in the US.

Keep in mind, many of these people were never retested with they Beta. They go into Yerkes’ raw statistics as zeroes. And in the final analysis many of these recruits were artificially assigned negative scores, depending on how well they had done on other subtests.

Yerkes own data also showed a clear correlation between test scores and health, and for immigrants showed a clear correlation between test scores and years of residence in the United States.

Yerkes explained away such observations by suggesting that low innate intelligence may induce living conditions unfavourable to good health, and that the less intelligent immigrants may fair less well in the United States and return to Europe in greater numbers.

All in all, it should be pretty clear that the Army mental tests were thoroughly biased, and that the racial conclusions based on them were nonsense.

And yet somehow Yerkes failed to notice. He saw exactly what he expected to see. Motivation can be a powerful thing.

Curiously, despite Gould’s anecdote about administering the Army Beta to his students being largely peripheral to his main argument, Warne and his colleagues’ attempt to refute Gould and defend the honour of the Army intelligence tests centres on an attempt to replicate precisely this exercise.

By administering the Army Beta to their own students at Utah Valley University, they sought to prove two points: that Gould’s had not made a good faith attempt to administer the test in a fair and unbiased manner; and that the Army Beta does, in fact, measure intelligence.

We’ll see how this little experiment of theirs panned out in a moment, but before we can do that there’s a question we need to answer: what does it even mean to say that a test measures intelligence?

What the hell is Intelligence Anyway?

Is intelligence quantifiable? Is there a single unitary thing called intelligence or are there multiple intelligences? Is the concept of intelligence culturally relative? Is it an innate or acquired characteristics? Are modern IQ tests a reliable and valid measure of intelligence?

These are just some of the many interesting scientific and philosophical questions that will not be answered in this essay.

Instead, we’ll take brief look at the principles behind the construction of modern IQ tests with a view to determining whether the Army Beta could possibly be a measure of intelligence as it is classically understood in the context of IQ research.

We begin with a thought experiment.

Imagine we take a typical class of school kids and we give them a maths test. Some kids do well, some do poorly, most cluster in the middle. Suppose we take the the top scoring 25% and call them group A and take the bottom scoring 25% and call them group B. We then give both groups a test in an unrelated subject, say, history. Which group do you think is likely to score more highly on this test? A or B?

Some of you are no doubt inclined to suggest that neither group is more likely than the other to score highly on the history test. After all, there are kids who do well in history but poorly in maths and kids who do well in maths but poorly in history.

But it turns out that group A is likely to do better. To understand why, consider what is being assessed in each test. A maths test assesses maths-specific knowledge and skills, certainly, but it also measures some more general abilities that are also useful in a history test. There is some common factor measured by both tests. For this reason, tests that require some mental effort tend to be positively correlated with each other, often quite strongly so.

Still, it would not yet be sound to equate this common factor with intelligence, it may be that maths and history both rely in subtle ways on the same learned skill or article of knowledge. For this reason all intelligence tests are comprised of a large number of different subtests, the quantity and varied nature of which hopefully eliminate the role of overlap between tests. The remaining common factor was taken by early 20th century psychologists, and by some modern psychologists, to be an actual measurable attribute of the human mind which they called general intelligence — that ability of the brain to do stuff that requires mental effort, and which it is argued is independent of education or culture even if our measure of it (IQ) is not.

There’s more to the argument than this, of course, but this essay is already pretty long already and I don’t want to lose half my remaining readers with a surprise lecture on factor analysis. Suffice it to say that statistical techniques support the notion that a single factor, rather than multiple factors, best explain subtest correlations. A single factor that has traditionally been identified with intelligence.

I should say at this point that I don’t actually agree with the arguments that this common factor is actually due to a real, measurable attribute of the people being tested. However, for the remainder of this essay I’ll be accepting it for the sake of argument. If you’re interested in the case against this interpretation, Gould presents an accessible argument against in in chapter 6 of The Mismeasure of Man.

Now, knowing what we now know about intelligence tests, what attributes would the Army Beta need to have in order for it to be said to measure intelligence?

Well, at a bare minimum its subtests would all need to be positively correlated with each other. But this alone isn’t enough, as we’ve seen that even subject specific school tests correlate with each other, and these tests don’t measure intelligence, per se. So we would need for the positive correlations to be stronger than that between purely scholastic tests.

We would also need that Army Beta scores correlate positively with other tests, and more strongly so than purely environmental factors correlate with these tests. After all, there’s a positive correlation between socio-economic status and educational achievement, yet it would be ridiculous to claim that parent’s bank balance is a measure of intelligence.

Finally, we would need that the results be best described by a one-factor model, as otherwise it cannot possibly be a measure of unitary intelligence.

With these conditions in mind, lets take a look at the results of Warne’s experiment.

How Not to Test a Hypothesis

Warne’s experiment sought to test four hypotheses.

The first is that the scores achieved by his students will be higher than those reported by Gould, due to a combination of Gould having administered the Army Beta in a biased way and the Flynn effect, that is the tendency for IQ scores to increase over time.

Note, however, that even if the data supported this hypothesis, we would have no way to determine whether it were true due to the Flynn effect, due to bias on Gould’s part or some combination of both.

The second hypothesis is that the completion rates for each section of the test will be similar to those reported by Gould. If this hypothesis were falsified and Gould’s reported completion rates were significantly lower than theirs then this would be prima facie evidence that Gould had administered the test in a biased manner to get the result he was expecting.

The third hypothesis is that results on each subtest of the Army Beta will be positively correlated with each other and with the student’s ACT scores and college GPA, with the possible exception of subtest 1 (which they consider may be too easy to get enough variation in the results). As we saw a moment ago, poor correlation would suggest that the test is incapable of measuring intelligence at all.

However were data to support this hypothesis it actually wouldn’t contradict Gould’s own analysis. His argument being that the Beta measures education, literacy and cultural familiarity more so than it does general intelligence, a correlation between educational attainment and Army Beta is to be expected either way.

You know, I’m starting to think this is a badly designed experiment.

The fourth hypothesis is that a one factor model will be a good fit to the data and will fit the data better than a two factor model.

The data will support this hypothesis if a statistical test finds the one factor model to be a good fit and a better fit than the two factor model. But the data will falsify this hypothesis if the one factor model is a bad fit, if the two factor model is a better fit than the one factor model or if both models fit the data equally well.

We saw earlier why support for a two factor model would undermine the notion that the Army Beta measured the same thing as modern IQ tests. However, the study’s authors also believe that if the data supports this hypothesis then it disproves Gould’s conjecture about the impact of some soldiers being unable to read or write numbers, since ability to work with numbers would be a second significant factor.

Did you spot the sleight of hand here? If it is true that (as Gould hypothesised) inability to read numbers was a major issue in the original test cohort, then we would expect a two factor model to fit better than a one factor model when applied to data from that original WW1 cohort.

But unless Utah Valley University routinely admits students who are completely illiterate there is no clear reason to expect a two factor model to be supported by their results.

Again, this is a very poorly designed study.

They do go on to test the fit of both models to Yerkes’ historical data, so I guess they just got confused when defining their hypotheses? (The study was pre-registered).

Either way, how did the experiment pan out?

The two hypotheses that relate to whether or not Gould fudged his results when testing his Harvard students, that is hypotheses 1 and 2, both vindicate Gould.

Warne’s students clearly performed more poorly on the Army Beta that Gould’s, failing to find any evidence that Gould administered the test in a biased way. The first hypothesis is falsified.

The second hypothesis is also supported by the data, again failing to find evidence that Gould’s result was a product of biased administration.

The last two hypotheses, which must be true if the Army Beta measured intelligence, are both falsified.

Many subtests either had no statistically significant correlation with each other or fell right on the border of statistical significance. Worse still, one of the subtests was negatively correlated with college GPA, which as we saw earlier, is a big problem. Hence the third hypothesis fails, and whatever else the Army Beta may be said to measure, it isn’t intelligence.

For hypothesis 4, using a chi-squared test they find that the two factor model is a slightly better fit than the one factor model when applied to the data from their college students, though this difference isn’t statistically significant.

When the rival models are applied to Yerkes historical data they do get a statistically significant result. In favour of the two-factor model.

Once again, Gould has been vindicated.

In total only one of their hypotheses is supported by the data, and it’s the one that doesn’t contradict Gould in the slightest.

The only response that an honest researcher could have when confronted with these data is to accept that they were wrong, say that their results don’t support their initial belief that Gould prejudicially administered his test or their assumption that Gould was wrong about the the validity of the Army Beta.

This was not their response.

Instead Warne persisted in accusing Gould of being committed to “ideology over data”, of lying to his readers and of treating data unfairly. This is especially ironic as when Warne’s data didn’t support his preconceived notions, he set about fudging the results.

Despite the fact that all four hypotheses as originally stated, with the exception of hypothesis 2, were falsified, Warne claims in his paper that only the first hypothesis had been falsified.

Hypothesis 3, asserted that ALL Army Beta subtests, the ACT results and GPAs would be positively correlated with each other except for possibly subtest one. In actuality they found multiple counter examples. Nonetheless they claim this hypothesis was, and I quote, “mostly supported”. Using weasel words to move the goal posts when they didn’t get the results that they wanted.

Still, there were some correlations, right? What can we infer from the results that they actually got?

Well, they found the scores on the Army Beta have an r=0.38 correlation with scores on the ACT and an r=0.14 with college GPA.

But, as explained earlier, a correlation between the Army Beta and educational attainment is to be expected based on Gould’s hypothesis that the Army Beta was measuring mostly education, literacy and cultural familiarity.

Additionally, such a meagre correlation is unimpressive in light of the truism of intelligence testing that all tests tend to be positively correlated, even ones that test domain-specific knowledge like school exams as explained earlier. No reasonable person would claim that a history test is a measure of somebody’s mathematical ability, so for the Army Beta to be a measure of intelligence the correlation used to support that claim must be greater than we would expect for two unrelated scholastic tests.

I did some digging and found a paper from around the same time that the Army Beta was in use. In the paper researchers present the observation that the performance of high school students on tests in various subjects are all correlated with each other, precisely as we would expect. He doesn’t present r values as regression analysis wasn’t always done in a rigorous manner at this time, but using his raw data I calculated them for sample of three subject pairs.

The results speak for themselves. An r=0.58 correlation between algebra scores and civics scores, an r=0.48 correlation between algebra scores and American history scores, and a whopping r=0.68 correlation between algebra and Latin. All stronger correlations than between the Army Beta and measures of educational attainment.

The Army Beta is a worse measure of intelligence than than an algebra test is a measure of ability to speak Latin.[5]

Not only does the Army Beta correlate with other tests worse than school tests of subject specific knowledge from the same time period, the Army Beta performs no better than many environmental predictors of educational achievement. If we are to accept that the Army Beta is a measure of innate general intelligence, then we also have to accept that your parent’s bank balance is a measure of your innate general intelligence.

To everyone besides Warne and his team, it’s pretty clear there’s something else going on here.

Hypothesis 4 is also said to be supported by the data, despite the fact that the two factor model outperformed the one factor model with both the data collected from Warne’s students and Yerkes’ historical data.

How do they justify this leap of logic? The principle of parsimony. That is, essentially, Ockham’s Razor. They argue that a one factor model should be favoured over the two factor model on the grounds that simpler models ought to be favoured over more complex models. They make this argument even for the WW1 data where the superiority of the two factor model was statistically significant.

Yeah. I wasn’t convinced either.

Sure, maybe you could argue the case that there are a priori reasons to favour a one factor model. But this would have nothing to do with the data they have collected. And it certainly doesn’t allow them to claim that said data now magically supports their initial hypothesis.

Instead of accept the fact that the good fit for the two factor model to the historical data implies, as Gould predicted, that the ability to read numbers was an issue for the original Army Beta, they try to explain this away by arguing that the chi-squared test they were using is biased towards more complex models.

That’s right. The statistic test they performed on their data found that the two factor model was a better fit than the one factor model, that this result was statistically significant, and then they concluded that the data actually supports the one factor model.

Just because.

To be clear, there wasn’t some alternative statistical test they performed that gave a different result, they just ignored the result of their only hypothesis test and decided that it didn’t matter.

If the results of a chi-squared test could not be a basis for rejecting the hypothesis then why was it the only statistical test that was carried out? It’s very obvious that they were expecting the opposite result and just had to hand-wave away the result they actually got.

Amazingly they have the audacity to claim that this data supports their hypothesis.

I think it’s pretty clear who’s fudging their results here, and it isn’t Gould.

I bet they regret that pre-registration now.

Warne’s Gish Gallop

In addition to their little experiment, Warne’s team follow the lead set by Lewis in responding to Gould by completely misrepresenting his position and then attacking that position instead. For example, Gould noted at one point that even Edwin Boring, one of the psychologists responsible for the original Army Beta analysis came to the conclusion in later life that the test couldn’t possibly measure innate intelligence. In section 3.4 of his paper Warne writes

“Gould’s final criticism was against early psychologists’ belief that the Army Beta measured intelligence. Quoting Boring, (as reported by Kevles), Gould called the belief “preposterous”. Gould also stated that test administration conditions “… made such a thorough mockery of the claim that recruits could have been in a frame of mind to record anything about their innate abilities”, and that it was “… ludicrous to believe that [the Army] Beta measured any internal state deserving the label intelligence”.
“Kevles also recorded Boring’s opinion in the 1960’s that “… the tests had predictive value …” because they correlated with meaningful outcomes in a soldier’s army career. Yet, Gould did not communicate this information to his readers. Additionally, Boring twice published articles in scholarly journals using the Army Alpha and/or Army Beta to measure intelligence. In both articles, Boring accepts that the army tests measure intelligence.”

Clearly giving the impression to readers that Gould had misrepresented Boring’s view. In reality, though, the papers in which Boring claimed the tests measured innate intelligence were from the 1920’s and 1940’s, while the interview in which he is quoted as describing this belief as preposterous came much later in the 1960s, after changing his mind, as Gould correctly reported. As to the point that Boring defended the notion that the tests had predictive value even in later life, this is irrelevant as having predictive value is not the same thing as measuring innate intelligence.

Both the research paper and the Quillette article are full of little tricks like these in which it’s just assumed that the reader doesn’t know better and wont bother to check.

At another point he notes that the total number of zeroes on a subtest was never more than 10%, with the implication that this is relevant. Gould’s made three points regarding the large number of zeroes: 1) that it suggests many of the men didn’t understand what they were supposed to do on any particular subtest. 2) That the scores for many of the subtests had peaks at zero or were bimodally distributed. And 3) That soldiers scoring zeroes had their scores adjusted in an ad hoc manner.

Warne dismisses the latter of these points on the grounds that individual soldiers were not given negative scores for assignment purposes, instead this was merely done for the purposes of the statistical analysis. A defence which seems odd as this is precisely what Gould was objecting to.

The second objection, about the non-normal distribution of scores wasn’t mentioned at all. And in so far as Warne’s response relates to the first objection it simply amounts to insisting that 10% of men getting a zero is not “vast numbers” of men getting a zero. Based on this rather strange and entirely subjective interpretation, Warne claims in his Quillette article that this is an example of an “outright lie” by Gould.

In response the the examples of cultural bias identified by Gould, Warne simply responds by feigning incomprehension of why these examples would be culturally biased or fail to meaningfully measure intelligence.

He suggests that since rural areas have higher levels of illiteracy, the illiterate test takers would be expected to understand that the pig was missing a curly tail. Not only has he cherry-picked one of the least problematic examples here, the allegation of cultural bias didn’t relate to cultural differences between literate and illiterate pig farmers, it related to cultural differences between Black and White Americans and between those born in the US and recent migrants.

It also isn’t clear from Warne’s defence how the aquired knowledge that the pig has a curly tail is relevant to assessing general intelligence.

In response to Gould’s objection that it is unreasonable to assume illiterate men could all read numbers, Warne has this to say:

“Gould’s. . . argument. . . requires readers to believe that large numbers of Army Beta examinees were completely baffled by the use of numbers on the test, despite these being some of the most basic written symbols in the English language — and almost any language that the examinees spoke as a native. With hundreds of thousands of men taking the Army Beta, it is likely that some men were genuinely confused about the test, but the available evidence appears to suggest that Gould’s claims regarding numerical comprehension were exaggerated.”

Unfortunately he doesn’t cite any of this evidence.

This section also contains obvious non-sequiturs such as:

“Recent immigrants (who had been in the United States for less than 5 years) were only 10–12% of examinees, and it is likely that some of these men spoke English. This contradicts Gould’s statement that, “Many Beta examinees were recent immigrants who did not speak English.””

Finally, their response to Gould’s criticism of testing conditions amounts to just saying that they don’t see this as a major problem. I mean it takes them four whole pages to say that, but that’s the gist of it. I’m not going spend any time responding to these points one-by-one because, frankly, its a side issue and entirely tangential to the point they’re trying to make with their experiment. You should feel free to read it and judge for yourself whether their argument has any substance to it.

Perhaps it shouldn’t be surprising that this paper turned out to be such a dumpster fire. After all, Warne and his co-belligerents have set themselves a difficult task: They need to show that Gould is wrong to suggest that socially embedded bias is a significant problem that science needs to be constantly on guard against, while simultaneously claiming that socially embedded bias is responsible for the allegedly incorrect findings of Gould and other 20th-century researchers who challenged scientific racism, and for the widespread acceptance and proliferation of these supposed politically correct errors.

Gould was entirely up-front about his egalitarian political beliefs, recounting some of his activism in the book’s introduction

“I confess, first of all, to strong feelings on this particular issue. I grew up in a family with a tradition of participation in campaigns for social justice, and I was active, as a student, in the civil rights movement at a time of great excitement and success in the early 1960s. . . I had taken part in many actions to integrate bowling alleys and skating rinks, movie theatres, restaurants, and, in particular, a Yellow Springs barber shop run by a stubborn man name Gegner (meaning “adversary” in German and therefore contributing to the symbolic value) who swore that he couldn’t cut a black man’s hair because he didn’t know how.”[6]

He confided this information because, in his own words

“One needs to understand and acknowledge inevitable preferences in order to know their influence–so that fair treatment of data and arguments can be attained! No conceit could be worse than a belief in one’s own intrinsic objectivity, no prescription more suited to the exposure of fools. . . The best form of objectivity lies in explicitly identifying preferences so that their influence can be recognized and countermanded.”[6]

Warne apparently agrees with Gould that it’s important to put such information out there, since he quotes from it at length in his smear piece in Quillette. Curiously, though, Warne is rather less forthcoming about his own political views or any deeply held beliefs that have the potential to influence his own research.

Since he apparently agrees about the value of putting your cards on the table, shouldn’t Warne practice some of this transparency himself?

Maybe he’s an avatar of pure logic, entirely detached from worldly concerns; though I suspect he’s no more indifferent to the outcome of his research than Gould was his own. Since he’s not forthcoming it’s hard to be sure.

I don’t know what personal biases Russell Warne may have, and hence I cannot say whether bias or mere sloppiness are to blame for his many errors. The same, however, cannot be said for his publishers.

The Junk Science Industry

The Journal of Intelligence, in which Warne’s paper is published, is an open-access journal in which authors pay to publish their results which are then available freely to the general public, in contrast with the conventional academic publishing model in which the journal makes money by charging people for access.

The ongoing shift toward open-access has the potential to improve transparency and make scientific knowledge freely accessible to those without access to the library of a big university. The public foot the bill, directly or indirectly, for almost all research; they should have access to it. I say this to make clear that I don’t think open-access journals are a bad thing — provided such journals remain committed to the highest standards of peer review. There is, however, a perverse incentive that must be guarded against: if journals get paid for each paper they accept, rather than based on the number of people who want to read them, they have a clear incentive accept more papers at the expense of quality.

Care, then, must be taken to assess what kind of open-access journal you are dealing with.

Owned by the MDPI group, this particular journal, which shares its name with a longer established and better known conventional journal (albeit one that has its own problems), boasts of offering its authors rapid publication, saying that the peer-review process will be completed and a decision made within weeks of submission. With publication following shortly after.

How do they get enough reviewers confident enough to complete such a task so quickly and on such short notice? Well one way they do it is by offering prospective reviewers a discount on publication fees for the next paper they want to publish in an MDPI journal.

This establishes a clear financial incentive for reviewers to agree to review articles they may not be well qualified to assess, or to take on reviews but not put any real work in.

This business model produces the kinds of results you would expect.

A few years ago another journal in the MDPI group published a paper titled Theory of the Origin, Evolution and Nature of Life which posited that everything from subatomic particles to DNA strands to planets and galaxies is made up of gyres. A gyre apparently being a helix, or a spiral, or a circle, or, uh, I dunno. I think it has to be round, and apparently its the basis of this guy’s own personal theory of everything.

The paper’s reviewers apparently reviewed all 105 pages and 800 references of the paper, whose subject matter spans biochemistry, evolutionary biology, quantum gravity and astrophysics, in less than a month. Gotta get that discount voucher for timely peer-review, eh guys?

Another of their sister journals published a hoax paper claiming that use of the herbicide glyphosate (brand name Roundup) causes cancer.

And autism.

What is it with these cranks and autism?

And that’s just a couple of famous examples.

The editors of the journal Nutrients, which has one of the highest impact factors of any MDPI journals, resigned on mass last year accusing MDPI of pressuring them to accept papers of lower quality.

Far from standing above society, untouched by worldly concerns, the scientific process is so deeply embedded in society that even something as mundane as the profit-motive can corrupt it if left unchecked.

But the profit-motive doesn’t create a bias toward one conclusion or the other, it simply lowers the bar in terms of quality. If that was the whole story this paper wouldn’t have been written-up on Quillette and tweeted out by high profile provocateurs.

Despite just being published in February, Warne’s paper already had more views than the all but one of the papers published in the same journal last year. It owes this success in large part to courting controversy and the publicity it’s received in Quillette.

It is, in effect, clickbait.

Academic clickbait.

The attention that has been generated will doubtless see hereditarian dinosaurs join Gould’s least principled critics in positively citing the paper, while others will write responses pointing out all of its flaws.

Many numerical measures of an academic’s success, like citation count and h-indices are not sensitive to the quality of the citing paper or the favourability of the citation.

The fact that university press offices are obsessed with attention grabbing headlines and are paying more and more attention to social media impact only adds to the incentive to do this.

Meanwhile, what does Quillette get out of this?

Quillette is an online magazine that’s seems to be preoccupied with a few main themes.

Particularly of relevance to Warne’s article, the magazine has a long running obsession with the reality of hard-wired racial differences in intelligence. Frankly, it’s the kind of magazine that, were Stephen Jay Gould still alive, would probably find itself the subject of a new chapter in any future edition of The Mismeasure of Man.

Quillette describes itself as a platform for free thought. Specifically, it seems, thought that is free from the constraints of evidence and sound inference.

Quillette loves to publish articles that cite new science, no matter how tangentially relevant or low-quality; it makes it look like there’s more substance behind their poorly researched blog posts than there actually is. It’s the opinion blogging equivalent of advertisers citing a 9 out of 10 study.

It’s the perfect little symbiotic relationship for bargain-bin academics whose ideological outlook aligns with Quillette’s editorial slant.

And it serves as a nice little demonstration of the social embeddedness of science. All those studies and response papers that will likely one day reference this one, whether positively or critically, would either not exist, or would exist in a potentially very different form were it not for the ability of Warne et al to pander to a particular political constituency.

This is an example, in microcosm, of how ideological forces can amplify the propagation, and prime the reception of a particular theory, and how socially determined bias can compromise science’s ability to self-correct. The survival of theories within the scientific marketplace of ideas is not dependent solely on the quality of evidence or mathematical rigour, but also on societal whim.

Just as Gould argued. Science doesn’t stand above society, pure and incorruptible. Real science, unlike the idealised science fetishised by so many Quillette readers, is carried out by humans, and as such prone to error.

And science is never so vulnerable to corruption as when we assume it immune.

Footnotes

[1] There were earlier criticisms of Mismeasure originating from the usual suspects, mostly based on defending the hereditarian interpretation of IQ. However two more recent papers, both presenting new empirical results, are among the most widely cited at present by those who wish to claim that Gould has been discredited. These are the studies which this essay focuses on debunking.

[2] See Kaplan 2015, Weisberg 2014 & 2016.

[3] What follows is a brief summary of Gould’s critique of Morton, see Ch. 2 of The Mismeasure of Man and Gould 1978. Only the key points are summarised and less consequential criticisms are omitted.

[4] What follows is a brief summary of Gould’s critique of Yerkes and his collaborators, see Ch. 5 of The Mismeasure of Man. Only the key points are summarised and less consequential criticisms are omitted.

[5] For students in at least some early 20th century US high schools.

[6] From the Introduction to the Revised Edition

Has Stephen Jay Gould’s The Mismeasure of Man really been discredited?

Written by Simon Whitten

**Has Stephen Jay Gould’s The Mismeasure of Man really been discredited?**