In a room sit three textbooks…
I’ve argued before that poor statistics education in the sciences is a major contributing factor to scientists’ general lack of understanding of basic statistical concepts. I’m not so much interested in factual inaccuracies, which are inevitable, as the material has to be presented to a largely non-technical audience, but instead with how textbooks go about handling the translation of statistical concepts to a non-technical audience.
Most results in statistics, even very basic ones, have extremely precise statements that require a rigorous understanding of concepts like a random variable or a density function, which usually have mathematical backgrounds beyond the scope of statistics courses in the sciences. This means that these concepts can’t be defined exactly correctly, but have to be given definitions which are close enough to true while still being understandable to students. This isn’t inherently negative – many practitioners of statistics need to develop a working knowledge of a concept or technique without understanding it rigorously, but simplification is a subtle art. For example, how do you simplify a definition in a way that doesn’t cause problems much later, when some subtle part of a definition which was simplified away actually becomes important? What about edge cases where the simplification doesn’t quite hold true? And how do you ensure that students understand that some simplification has occurred, and aren’t tricked into thinking that they understand a concept rigorously.
Take the central limit theorem (CLT) as a illustration.
Central limit theorem Let \(x_1,\dots,x_n\) be a random sample from a distribution with finite mean \(\mu\) and variance \(\sigma^2\). Then, as \(n \rightarrow \infty\), the following holds: \[ \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathrm{N}(0, \sigma^2) \] Where \(\xrightarrow{d}\) denotes convergence in distribution.
There is a lot to unpack in this definition – convergence in distribution, finiteness of the mean and variance (what does it mean for a distribution to lack a mean or variance?), and the general fuzziness with which distributions and random variables are defined in many statistics courses. This means that, in the sciences, the CLT almost always has to be summarized, rather than stated (much less proved) rigorously. The goal is to give the student some kind of intuition along the lines of “if our data come from some well enough behaved distribution, then then sample mean becomes more and more normal as the sample size increases”, while introducing a minimum of false intuition or other weirdness alongside it. Sometimes this requires saying things that aren’t exactly true, but that give the right intuition, or that are true broadly enough that the student is unlikely to run into trouble. It might also involve neglecting to mention things that are true, if they add needless complexity. I thought it might be interesting to look at a few of the statistics textbooks I’ve seen used in psychology and neuroscience departments, to see how they handle the problem.
A great example of this process done well is found in Navarro’s Learning statistics with R
When talking about probability distributions, we’ve made use of the idea that observations X are sampled from some distribution […] This idea can be extended, to talk about the distribution of a sample statistic […] we should be able to talk about \(\bar{X}\) as having a distribution. Learning statistics with R (Section 10.3, p. 300)
The law of large numbers is a very powerful tool, but we can actually make a stronger statement than this, using one of the most useful mathematical results in all of statistics: the central limit theorem. Let’s suppose that the population distribution of \(X\) has a mean \(\mu\) and standard deviation \(\sigma\), but it’s not a normal distribution. The central limit theorem says that, as the sample size increases, the sampling distribution of the mean \(\bar{X}\) will become a normal distribution, with mean \(\mu\) and standard deviation \(\sigma / \sqrt{N}\). By convention, the standard deviation of a sampling distribution is called the standard error (SE). Therefore, the standard error of the mean \(\bar{X}\), can be described by the following equation \[ \mathrm{se}(X) = \frac{\sigma}{\sqrt{N}} \] Learning statistics with R (Section 10.4, p. 301)
This is about the best overview of the CLT you’ll find in an introductory statistics textbook. Note that some details of the theorem aren’t mentioned – for example, the assumption that \(X\) has finite first and second moments, which is probably just a distraction at this level. It’s also a bit loose with the actual statement of the theorem, stating that \(\bar{X}\) becomes normal, which a student could easily confuse to mean that normality is achieved for some finite \(N\). But Navarro then goes on to clarify these misunderstandings by performing simulations so that the student can actually see what is happening to the distribution of \(\bar{X}\) as \(n\) increases.
One thing I really like is that the book communicates the gist of the theorem without trying to distill it down to a set of concrete rules (e.g. if the sample size is greater than 30, then we can assume that the distribution is normal), which almost always has the effect of tricking the student into thinking that their loose intuition is more rigorous than it is, simply by virtue of concrete rules seeming more rigorous, subjectively. Instead, the book does simulations from a variety of distributions and explicitly notes the differences in convergence, showing that 1) the behavior of the CLT is context dependent, and 2) the CLT gives an approximation, and we have to choose how much approximation error we are willing to tolerate in practice. Some information is lost in the simplification, but the student comes away with the understanding that there are subtleties which were not discussed, and are not tricked into believing that they understand the theorem fully.
The book does all of this without being condescending – just noting that the full details of the theorem are too complex to be taught in an introductory course, and then moving on with an illustration of what the theorem entails in practice. Many statistics textbooks fall into the trap of assuming that students despise the material and trying to placate them with assurances like “now don’t worry, we won’t make you do any math” or “some super smart people worked out the details, but you only have to worry about X”. I think these kinds of statements are intended to make the material more relatable, but really they just come across as insulting.
Consider this next example, from Mayer’s Introduction to statistics and SPSS in psychology, in many ways the anti-Navarro:
In statistics, central limit theorem states that the mean of the sampling distribution equals the mean of the population and that the standard error of the mean equals the standard deviation of the population. … So long as distributions are relatively normal, we can use the principles of central limit theorem to make inferences about probability and statistical significance with relatively small samples. But how small is small? In general terms, a sample of 30 or more will probably suffice. However, we can be more precise if we make the effort to find out more about the distribution of our sample. According to central limit theorem, the sample is large enough if any of the following holds true:
Where the sample size is 15 or less: – a. The distribution must be normally distributed, – b. have no outliers and – c. must be unimodal (have one peak in the curve).
Where the sample size is between 16 and 40: – a. the distribution must be no more than moderately skewed, – b. have no outliers and – c. must be unimodal.
Where the sample size is greater than 40: – a. the distribution must have no outliers. Introduction to statistics and SPSS in psychology (2013; Ch. 4, p. 77)
Where to begin. First, the definition is simply wrong. That is not the statement of the central limit theorem, and it’s not even true. The standard error of the mean is most certainly not equal to the standard deviation of the population in general. What’s even more bizarre is that the book knows this – having defined the standard error only pages earlier. Neither does this discussion build any intuition for how the CLT works. Whereas Navarro carefully develops the concept of a sampling distribution, explaining that because the sample is random, anything we compute from the sample is also random, and thus has a distribution, and then shows exactly what this entails through simulation, Mayer gives a confusing attempt at a definition:
To get a better representation of that population, we could collect data from many samples. This task would be onerous, so we can use statistics to ‘model’ those theoretical samples. We call this a sampling distribution. Had we actually collected all possible samples, each sample would have a different mean and standard deviation. In a sampling distribution, we assume that the mean is the same as it is in the entire population, so long as that population is normally distributed. Introduction to statistics and SPSS in psychology (2013; Ch. 4, p. 73)
which offers no intuition for what a sampling distribution is, and includes the bizarre statement “In a sampling distribution, we assume that the mean is the same as it is in the entire population, so long as that population is normally distributed”, which I assume is alluding to the fact that the expected value of the sample mean is equal to the mean of the population, but is phrased in such a strange way that I can’t imagine a student would be able to parse it, given that “sampling distribution” hasn’t even been defined beyond the claim that we somehow use it to model “theoretical samples”. Without making clear that the mean actually has a distribution, the book can’t hope to develop any intuitive understanding of the CLT, so (after giving an incorrect statement of the theorem) the book goes on to list several rules. “According to the central limit theorem, the sample is large enough”, if it has at least 40 observations, but we can get away with 15 if the sample is normal. But then why do we need the CLT? If the sample is normally distributed, the mean is already normal, so we don’t need the CLT at all. Of course, the CLT doesn’t actually provide any of these “rules”. Whereas Navarro is honest that its definition is not rigorous, and instead tries to develop intuition by showing how the CLT works in practice, Mayer gives no intuition at all, and lies to the student by giving a definition and a set of rules that appear to be rigorous, but are actually false.
Another common textbook, Gravetter’s Statistics for the Behavioral Sciences does a little bit better
[A] mathematical proposition known as the central limit theorem provides a precise description of the distribution that would be obtained if you selected every possible sample, calculated every sample mean, and constructed the distribution of the sample mean. This important and useful theorem serves as a cornerstone for much of inferential statistics. Following is the essence of the theorem
Central limit theorem: For any population with mean \(\mu\) and standard deviation \(\sigma\), the distribution of sample means for sample size \(n\) will have a mean of \(\mu\) and a standard deviation of \(\sigma/\sqrt{n}\) and will approach a normal distribution as \(n\) approaches infinity.
The value of this theorem comes from two simple facts. First, it describes the distribution of sample means for any population, no matter what shape, mean, or standard deviation. Second, the distribution of sample means “approaches” a normal distribution very rapidly. By the time the sample size reaches \(n = 30\), the distribution is almost perfectly normal. Statistics for the Behavioral Sciences (9th ed. Ch. 7, p. 205)
This is…eh. It at least attempts to describe the concept of a sampling distribution, and the statement of the CLT is true-ish, through not nearly as “usefully wrong” as Navarro, which made it clear when it was deviating from the rigorous statement of the theorem, and generally did so only when it was necessary to avoid confusion. Instead, Gravetter confidently states things like “By the time the sample size reaches \(n = 30\), the distribution is almost perfectly normal”, which is both not true and not helpful.
I’ve deliberately avoided mentioning the “significance issue”, since I’m more interested in how textbooks approach the presentation of material to a non-technical audience than I am with misinterpretations of significance testing specifically, but it’s worthwhile to see how each book covers the material. I find that most textbooks present significance testing in a way which is technically correct, but encourages the false intuition that the p-value is somehow a measure of the probability that the observed results are “true”. Gravetter is good about this. It develops the logic of significance testing properly:
A significant result permits the following conclusion: “This specific sample mean is very unlikely (\(p < .05\)) if the null hypothesis is true.” Statistics for the Behavioral Sciences (9th ed. Ch. 8, p. 260)
and then devotes an entire section to dispelling the misunderstanding that the p-value quantifies the probability of the null hypothesis. I’d have to say that, despite devoting far too much time to significance testing, which I don’t think is very useful in practice, Gravetter actually covers the logic of significance testing very well. Unfortunately, it falls back into common habits when discussing confidence intervals:
…we can confidently estimate that the value of the parameter should be located in the interval. Statistics for the Behavioral Sciences (9th ed. Ch. 9, p. 300)
On the other hand, Mayer is just a disaster:
By stating that there is less than 5% probability that an outcome occurred by chance, we are actually saying that there is a less than 5% probability that the null hypothesis is “true”. Introduction to statistics and SPSS in psychology (2013; Ch. 4, p. 68)
Absolutely not.
A key factor to remember with hypothesis testing is that we are dealing with probability, not certainty. Statistics will only tell us the likelihood that the outcome occurred by chance. Introduction to statistics and SPSS in psychology (2013; Ch. 4, p. 70)
No no no.
In this first example we will demonstrate how we can calculate the achieved statistical power, based on outcomes from a completed study. Introduction to statistics and SPSS in psychology (2013; Ch. 4, p. 84)
God no. Honestly, I’m a little disturbed to see Mayer released by such a major publisher (Pearson). I find it very difficult to follow, even having a strong background in the material. The book uses terminology in extremely confusing and non-standard ways. For example, it seems to conflate the standard error of the sample mean with the standard deviation of the population, or maybe it uses “population” to mean the distribution of sample means under repeated experiments? It’s hard to tell:
…the standard error of the mean equals the standard deviation of the population. Introduction to statistics and SPSS in psychology (2013; Ch. 4, p. 77)
As we saw earlier, the population is potentially infinite, so we need to estimate the standard deviation of the population from the standard error. Introduction to statistics and SPSS in psychology (2013; Ch. 4, p. 78)
Standard Error: the average variation of scores in a sampling distribution, or the estimated variation in the population. Introduction to statistics and SPSS in psychology (2013; Ch. 4, p. 80)
This isn’t a trivial mistake. The standard deviation and standard error are not only not equal, they don’t even quantify the variability of the same distributions. By confusing the concepts of sample, sampling distribution, and population, the book makes it impossible to understand any one of those concepts individually, which makes it impossible to understand any technique that references those concepts.
Finally, Navarro takes the approach of working out a binomial example in painstaking detail before introducing the general concept of significance testing. I like this, since it makes it absolutely clear what is being calculated, and leaves no room for misinterpretation (though later it does very explicitly address the common misinterpretations of significance testing). The book then goes on to describe in detail the conflicting Neyman and Fisher approaches to statistical testing, though I’m not sure if this would help the student or just confuse them.
Though only three, I’d argue that these books give a pretty good idea of the general classes of books that are commonly aimed towards students in the sciences: books which are broadly or technically correct, but don’t really give good intuition, and don’t accurately portray how statistics is done in practice; books which are just plain wrong; and books or lecture notes, almost always independent or open source efforts, which portray the material in a way which is technically correct, and try desperately to convey the right intuition, but still contend with having to inoculate students against bad practices in other books and in the literature. This means that even good books like Nevarro have to waste entire chapters on significance testing just because students have to understand it thoroughly in order to read the literature. To its credit, it relegates the material to a single chapter (Gravetter devotes three chapters to the t-test! Why?!), and devotes an entire chapter early on to estimation (the only introductory textbook in the sciences to do so, as far as I know). The remainder of the book is dedicated to regression and ANOVA as models, and not simply as elaborate versions of the t-test, which is great. It’s a very minor point, but Nevarro is also the only book to make a distinction between regression as a model and least-squares estimation. I think this distinction between a statistical model and the technique used to estimate it is actually very important when building intuition for modeling more generally.
I wrote this post out of curiosity, as I wanted to see how statistics was being taught here at Queens, and I ended up picking the first three books I found in stacks of old course outlines. I guess you could interpret this as an endorsement of Nevarro’s Learning statistics with R, but really, I think Nevarro is just the closest you can get to good given the required constraints for statistics courses in fields where statistics just isn’t used very well. I once took a course in Topology taught by the Moore method, and the more I think about it, the more I think it’s the only right way to approach data analysis more generally. Give students datasets and have them devise methods to answer questions about them; have other students try to break those methods, or identify boundary cases or limitations; then have them devise ways to test or verify their models. No mass-univariate significance testing silliness would ever survive this kind of Darwinism.