Must we adjust p-values if we test multiple hypotheses?
Yes, yes you must
Jerzy Neyman and Egon Pearson developed their idea of hypothesis testing in 1933. This has since become the dominant paradigm of statistical inference. The goal of Neyman and Pearson’s method were statistical tests which control errors. So scientist’s could adopt claims without being wrong too often. A single number, alpha, controls the long run amount of wrong decisions. It is set before a scientist conducts any experiment. One corollary of this is that the scientific literature contains errors. Be careful about trusting scientific claims.
If you want to test more than one hypothesis, then you can’t use the overall alpha for all tests. Using the same alpha for all tests inflates the false positive error rate. A false positive is when a statistical test claims that an effect is present, but no effect is present. When we add an endpoint to our experiment we are not increasing the number of endpoints by one. We increase the number of endpoints by an order of magnitude. With one dependent variable, you have two endpoints for the experiment: either true (T) or false (F). Adding a second dependent variable, the experiment table: FF, FT, TF, TT. As you add more hypotheses it becomes more likely that any single one will be a false positive. To prevent this, you need to control the error rate at the level of the claim. Statisticians recommend that researchers use some form of multiplicity correction in these situations.
Adding an endpoint to your analysis is legitimate. For example, testing whether a drug is effective for depression and anxiety. But leaving your significance threshold uncorrected is not legitimate. Some scientific disciplines question whether they need multiplicity corrections. Any paper which doesn’t advocate using multiplicity correction is wrong. The need for adjustments is a straightforward consequence of the Neyman-Pearson approach. There are two cases which require multiplicity corrections:
- When your experiment has multiple endpoints.
- When your experiment has multiple levels of an intervention.
The simplest approach to multiple comparisons is by controlling the familywise error rate (FWER). The FWER is the probability of making at least one error across a family of tests. It ensures that the total rate of false positives results is always below a strict threshold. Keeping the FWER at an acceptable level is challenging, so it is a stringent rule. The best-known method to control the FWER is the Bonferroni correction. This correction divides the significance level by the number of tests. This controls the alpha of the overall family of tests. Researchers use Bonferroni corrections as they are simple. The resulting significance threshold can be known ahead of time. The drawback of the Bonferroni correction is that it applies the harshest correction. It is so conservative that it can miss true effects.
What a family of tests is, requires some thought. Error control does not aim to constrain the number of erroneous statistical inferences. It also aims to constrain the number of erroneous theoretical inferences. So we need to understand which tests relate to a single theoretic inference. Researchers don’t correct for multiple comparisons becuase they don’t understand their theoretical inference.
We can control error rates for all tests in an experiment. This is the experimentwise error rate. Or we can contorl error rates for a specific group of tests the familywise error rate. Broad questions have many possible answers. As an example consider running a 2x2x2 ANOVA. We test for three main effects, three two-way interactions, and one three-way interaction. This makes seven tests in total. We want to know if there is ‘an effect’ out of any of the interactions we can test for. Rejecting the null-hypothesis in any test would suggest the answer to our question is ‘yes’. Here the experimentwise error rate is what we need to correct for. This is the probability of deciding there is any effect, when all null hypotheses are true. A 5% alpha level for every test in the ANOVA would suggest that there is an effect 30% of the time. This is even when all null hypotheses are true.
Researchers are often vague about their claims. They do not specify their hypotheses clearly enough. So the issue of multiple comparisons remains.. Suppose we want to compare predictions from two competing theories. So we design an experiment to test them. Theory A predicts an interaction in a 2x2 ANOVA. Theory B predicts no interaction, but at least one significant main effect.
The researcher will perform three statistical tests. But don’t assume that we need to control the error rate across all three by using a/3! A ‘family’ depends on a set of related tests. In this case, where we test two theories, there are two families of tests. The first family consists of a single interaction effect. The second family of two main effects. For both families the overall alpha level is 5%. We will decide to accept Theory A when p < α for the interaction, and when p > α we will decide not to accept Theory A . If the null is true, at most 5% of these decisions we make in the long run will be incorrect. So we control the percentage of decision errors. For the second theory, we need to apply a Bonferroni correction. We will decide to accept Theory B when p < α/2 for either of the two main effects, and not accept theory B when p > α/2. When the null hypothesis is true, we will decide to accept Theory B when it is not true at most 5% of the time. We could still accept neither theory, or even both, if the experiment was not the decisive test.
Here’s another example. Let’s assume we collect data from 100 participants in a control and treatment condition. We collect 3 dependent variables (dv1, dv2, and dv3). The true effect size is 0). We will analyse the three dv’s in independent t-tests. We must specify our alpha level, and decide if we need to correct for multiplicity. How we control error rates depends on the claim we want to make. One claim is that the treatment works if there is an effect on any of the three variables. This means we corroborate the claim when the p-value of the any t-test is smaller than alpha level. We could also want to make three different predictions. Here we have three different hypotheses, and predict there will be an effect on dv1, dv2, and dv3. The criterion for each t-test is the same, but we now have three hypotheses to check (H1, H2, and H3). Each of these claims can be corroborated, or not.
You don’t always need to control the FWER. Limiting the number of false positives among all discoveries is good enough most of the time. There is a name for this rate, and it’s called the false discovery rate. In maths you can often define yourself out of a problem, and the false discovery rate is one such move. The false discovery rate (FDR) is the average number of false positives in all positive tests. Any adjustment scheme for the FDR is more forgiving than a scheme for the FWER. This comes at the cost of more false positives overall.
The Benjamini-Hochberg procedure is a common method used to control the FDR. It calculates a sequence of increasing adjusted p-values for each different hypothesis. Unlike the Bonferroni correction it is data-dependent. So the outcome cannot be known in advance.