Cargo cult statistics
The failure of scientists, and particularly students in the sciences, to properly understand the most commonly used statistical concepts in their field has been extensively documented (see e.g. Sotos et al., 2007). Many of the specific misunderstandings are well known at this point to anyone with an interest in methodology – for example, Haller, et al. (2002) questioned students and faculty at several Universities and found that the overwhelming majority could not correctly define a p-value, a crisis that has launched a thousand publications.
I’m not interested in these misunderstandings, because I don’t consider them to be very important. The failure of students and researchers to understand (say) basic hypothesis testing has led to a small cottage industry and several tenured careers spent attempting either to correct the specific misunderstanding, or to take away null hypothesis testing altogether and replace it with something like a Bayes factor. All of these efforts are doomed to fail because they ignore the fundamental problem: these misunderstandings are not isolated gaps in knowledge; they represent a fundamental lack of understanding of basic concepts in statistics and probability.
If we want to understand how these misunderstandings come about, thinking about how students learn statistics can be instructive.
For those who might be unfamiliar with the way that statistics is taught in the natural or behavioural sciences: undergraduate students typically take a department mandated course (or two) in data analysis sometime in their second or third year. Unless their major is highly quantitative, most students do not have a strong background in mathematics. Many will not have taken any math since high school, while others may have taken a required course in “Calculus for non-math majors”. As a result, the course is largely taught by pictures and analogy. Generally, instructors will comfort their students on the first day of class by reassuring them that, although there will be “formulas”, the main goal is for them to understand the “meaning” of the equations, and so gain some form of “intuition” about how to use statistics.
Generally, intuition is something I imagine resulting from a deep understanding of a field or concept, so that a person understands fundamentally how it behaves. How a student can gain an intuition for statistics without solving any statistical problems, or grappling with the central theorems of the field, is beyond my guessing. Intuition comes after technical proficiency, it’s not a replacement for it. This doesn’t mean that every student needs a background in measure theoretic probability, but shouldn’t a researcher who is going to work with statistical models as part of their career compute the expected value of a distribution at least once in their life? I’ve floated this idea by colleagues before, and most of them are strongly opposed to it. They generally respond with some variant of “I don’t need to know the math behind everything, I know it well enough to use it in practice”. I’ve even heard colleagues claim that they don’t think students should take courses in the statistics or mathematics departments because “those courses don’t teach them how to use statistics!”. And yet, here we are. Over 90% of psychology undergraduates cannot correctly interpret a hypothesis test. Apparently they didn’t get much intuition after all.
Some courses may provide an introduction to probability – necessarily discrete, since the instructor can’t assume a background in calculus. This will consist mostly of when you should multiply or add probabilities, and a few examples of Bayes theorem. None of these will ever be used or discussed again, as almost all statistical models used in practice are continuous.
The course then moves on to common summary statistics. This is generally taught by giving formulas for the mean, median, standard deviation, and variance, and explaining that the median should be used if you plot a histogram and it looks asymmetrical. A typical exam will here present a skewed histogram with the question “Which measure of central tendency would you compute from this data, the mean or the median?” This is exactly the kind of question that encourages the mindless parroting of trivia. Sure, the mean will be influenced by the skew in the data, but the sample mean is also – ya know – an unbiased estimator of the mean. The median is not (in general). What if I’m trying to estimate the population mean? What if I want to use my estimate in some bigger model? Is the distribution of the median consistent with my model’s assumptions? Or maybe I just want a robust estimate of the center of my data. Different goals will require different measures. Sure, these considerations are beyond the scope of two-weeks-into-an-intro-statistics-course, but talking with graduates of these courses never gives the impression that these complexities have been addressed at all. To students, the mean is the formula you use when your data are normalish. Otherwise, you use the other formula.
All courses must inevitably introduce hypothesis testing, which requires introducing the concept of a sampling distribution. This is hopeless. I am not aware of any literature actually quantifying student misunderstandings in this area, but have never in my life (and I have tried very hard) encountered a student, textbook, or instructor in the behavioral sciences with a correct, or even vaguely correct, understanding of the concept of a sampling distribution. How could it be otherwise? Consider the definition, and the depth of knowledge that it actually assumes:
A sampling distribution is the distribution of a statistic.
Now, approach this definition from the perspective of a student with no formal training in probability or statistics. Unpacking:
- A sampling distribution
-
Alright, so it’s some special kind of distribution, or maybe something different from a distribution? Why does it have a special name? In reality, a sampling distribution is just a distribution. We just give it a special name when the random variable under consideration happens to have been computed from the data.
-
is the distribution
-
The student at this point has likely never seen a density function, let alone manipulated one. Distributions are abstract concepts that basically mean that the histogram of their data will have a particular shape to it. For example, data are normally distributed if the histogram is bell shaped. The instructor says something about a mean, but I only have one mean! How can it be bell shaped? Apparently if I did a bunch of experiments it would be bell shaped, but I only did one.
-
of a statistic
-
A statistic is a function of the data, but this is simply never defined, or even alluded to. The sampling distribution is discussed only with reference to the mean. In fact, if you ask most students, they will reflexively identify the sampling distribution concept with the mean of a sample. The idea that, say, a median or any arbitrary function of the data has a sampling distribution is foreign to them. Many of them will define a sampling distribution as some variant of “the thing your mean comes from in a hypothesis test”
For obvious reasons, this definition is never provided. Instead, the instructor will conduct a thought experiment in which we hypothetically collect data many times and plot the mean of each sample. This will, the central limit theorem assures us, look bell shaped. Researchers who have been immersed in statistics for most of their working lives might not appreciate just how abstract this “drawing many hypothetical samples” business really is, but to a student with no technical background, giving them only this vague handwavy hypothetical repeated experiment business just leaves them hopelessly lost in the dark. There’s nothing for them to grapple or reason with. Certainly, they can’t use it to extrapolate beyond what they’ve been taught.
Next, the instructor introduces a deeply important distinction. Up until now, in their ignorant savagery, the student has been dividing their sum of squared deviations from the mean by the sample size. This is called the sample variance, and it simply will not do for inference. For this, they must use the sum of squared deviations from the mean divided by the sample size minus one. This is called the population variance. This is perfectly sensible; after all, we’re interested in the population, so why not use the population variance? But what’s the difference? Well, one is the variance of the sample, and the other is the variance of the population. There will be mumbling about “degrees of freedom”, and how they lost one.
So, I computed the population variance of my data and it was 2.54. Does that mean it came from a population with variance 2.54? Well, no. The population variance is an estimate of the population variance. But does that mean the sample variance isn’t? Here, the student is told that the sample variance is biased. It tends to underestimate the population variance, while the population variance doesn’t. Logical, but the population variance will also be further away from the true variance, on average – it has a larger mean squared error than the sample variance. And what about standard deviation? The square root of the population variance is a biased estimate of the true standard deviation, so why do we prefer it over the square root of the sample variance? Of course, all of these properties assume that our data are normally distributed. If not, all bets are off. And all of these properties are moot when the sample size is reasonably large, since any minor difference between dividing by the sample size and dividing by the sample size minus one will be dwarfed by the measurement error and statistical uncertainty of the data. Sometimes, when I’m doing statistics in public, I like to use the sum of squared deviations from the mean divided by the sample size plus one, just to see people squirm.
The student is not equipped to understand any of this discussion because the student has never been introduced to the concept of an estimator. The fundamental problem of statistical inference is that our data have been generated by some process, and we want to compute something, anything, from the data that will give us knowledge of that process. If the process has a variance, we want to compute something that we hope will be reasonably close to that variance. We could conceivably compute anything – we just have to decide what properties we want our computation to have. We could want it to be right on average (unbiasedness), we could want it to be consistent across datasets (low variance), or as close as possible to the true value on average (low mean square error), or maybe robust to errors or contaminants in the data. We can choose whatever we want. The student has never been introduced to the concept of estimation – they’ve just been handed the concept of the “population variance” and told that it’s the thing they use when they’re trying to “generalize to the population”. In the future, as researchers, they don’t fully understand why they divide by the sample size minus one, they just know that it’s the proper thing to do.
There is a pithy way to summarize all this. Richard Feynman said the following in a 1974 address to Caltech:
In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to imitate things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas—he’s the controller—and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.
Cargo cults occur when when a group imitates the aesthetics and rituals of a discipline without understanding the reasons behind them. Cargo cult statistics happens when students and researchers misinterpret the fuzzy conceptual descriptions of statistical procedures they learned as undergraduates as being literally or universally true. This usually results in a rule of thumb or layman’s description becoming morphed into a concrete statistical rule.
“We usually divide by the sample size minus one when estimating a normal variance because it’s unbiased” becomes “You only divide by the sample size when you want to summarize your data. If you want to generalize to a population, you have to subtract one”. That many valid estimators of a normal variance exist, and that any one of them can be selected depending on the properties desired by the researcher, is lost on the student.
“A main effect in a regression model describes the predicted change in the independent variable per unit change in the corresponding predictor when all other predictors are held constant” becomes “A main effect in a regression model is the effect of a predictor controlling for the other variables”. This leads to the dangerous idea that you can somehow “control” for a variable by including it in a regression model. In reality, the first line of control is always in the data collection process. Regression models assume very specific functional relationships between the predictors and the independent variables, and the distinction between “controlling” and “holding constant” is important.
“A result this extreme probably wouldn’t occur under the null hypothesis” becomes “the null hypothesis is probably false”. So much has been written about this misunderstanding that it needs no further discussion.
General statistical concepts become conflated with one of their specific incarnations. Factor analysis assumes multivariate normality and is often estimated by maximum likelihood. I have been challenged several times by students who have studied factor analysis for using maximum likelihood to estimate other statistical models. You can’t use maximum likelihood to estimate a binomial probability, I have been told, because maximum likelihood assumes that the data are normal, and coin flips are not normal. Similarly, “least squares” and “linear regression” are synonymous. There is no distinction between a statistical model and the method used to estimate its parameters.
The word “population” is used with an almost mythical vagueness, and the ability to generalize to it possessed by only the most sophisticated of models. Consider the concepts of “fixed” and “random” effects, which are given incomprehensible explanations like “An effect is fixed if you have specified all possible levels of the effect, and random if the effects were randomly sampled from a population, or if some levels of the effects are unknown”. This inevitably followed by “fixed effect models don’t let you generalize to the population”. This is statistical gibberish, and is symptomatic of a general failure of universities to teach statistical modelling. Students do not understand that they can hypothesize any relationships they want between their variables, and formalize those relationships in the form of a statistical model. Instead, statistics is a set of procedures. If you have groups, you do an ANOVA. If your variables are continuous, you do a regression. If you want to say something about the “population”, you have to use a specific kind of regression called random effects. Fixed effects doesn’t have the “generalize to population” property. Of course, there is nothing built into either class of model that suggests such a thing. One type of model assumes that some coefficients have a particular distributional structure, whereas the other does not. Which one the researcher uses depends entirely on what they believe to be true about their data.
The sciences are filled with this kind of “folk wisdom” that has no real statistical or mathematical basis, but is widely believed and taught simply because it was the easiest thing for the student to grapple with when faced with the otherwise overwhelming vagary and abstraction that is statistics without any rigor or technical background. The student certainly can’t be expected to reason with or derive any concrete knowledge from “doing hypothetical experiments lots of times to get a normal distribution”, and so their understanding ends up crystallizing as a set of steps and guidelines that they vaguely remember but don’t remember why, like “divide by the sample size minus one so that I can generalize to the population”. Many university departments have not had any contact with a statistics or mathematics department in decades, and so students are taught by former students of former students who themselves have the same vague understanding.
As a result of all this, if you ask a student or researcher to provide a correct definition of the p-value, they cannot. The solution to this problem, says hundreds of articles in Nature or Psychological Science or the Journal of Personality and Social Psychology, is to provide them with the correct definition. But this is plainly useless – a researcher with a flawed understanding of one of the most widely used and fundamental concepts in applied statistics will not be made whole again by providing them with the proper definition of a p-value. And the solution is certainly not to provide them with even more dangerous machinery, like a Bayes factor or a posterior distribution. The whole foundation has to be removed and replaced.
References
Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers. Methods of Psychological Research, 7(1), 1-20.
Sotos, A. E. C., Vanhoof, S., Van den Noortgate, W., & Onghena, P. (2007). Students’ misconceptions of statistical inference: A review of the empirical evidence from research on statistics education. Educational Research Review, 2(2), 98-113.