Power failure threatens neuroscience
A new research paper with the cheeky title “Power failure: why small sample size undermines the reliability of neuroscience” has come out in a neuroscience journal. The basic idea isn’t novel, but it’s one of these statistical points that makes your life more difficult (if more productive) when you understand it. Small research studies, as everyone knows, are less likely to detect differences between groups. What is less widely appreciated is that even if a small study sees a difference between groups, it’s more likely not to be real.
The ‘power’ of a statistical test is the probability that you will detect a difference if there really is a difference of the size you are looking for. If the power is 90%, say, then you are pretty sure to see a difference if there is one, and based on standard statistical techniques, pretty sure not to see a difference if there isn’t one. Either way, the results are informative.
Often you can’t afford to do a study with 90% power given the current funding system. If you do a study with low power, and the difference you are looking for really is there, you still have to be pretty lucky to see it — the data have to, by chance, be more favorable to your hypothesis than they should be. But if you’re relying on the data being more favorable to your hypothesis than they should be, you can see a difference even if there isn’t one there.
Combine this with publication bias: if you find what you are looking for, you get enthusiastic and send it off to high-impact research journals. If you don’t see anything, you won’t be as enthusiastic, and the results might well not be published. After all, who is going to want to look at a study that couldn’t have found anything, and didn’t. The result is that we get lots of exciting neuroscience news, often with very pretty pictures, that isn’t true.
The same is true for nutrition: I have a student doing a Honours project looking at replicability (in a large survey database) of the sort of nutrition and health stories that make it to the local papers. So far, as you’d expect, the associations are a lot weaker when you look in a separate data set.
Clinical trials went through this problem a while ago, and while they often have lower power than one would ideally like, there’s at least no way you’re going to run a clinical trial in the modern world without explicitly working out the power.
Other people’s reactions
Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »
I’m a bit confused by something. The claim that a given p value represents less evidence against the null in a low-powered study than the same p from a high-powered study seems to conflict with a simulation by EJ Wagenmakers’ (see especially page 792): http://ejwagenmakers.com/2007/pValueProblems.pdf
Wagenmakers showed that the same p value produced by two studies with different Ns represents stronger evidence against the null in the *smaller* study. This is because a significant finding in a small study implies a larger sample effect size, which means stronger evidence against the null.
Could the different conclusions be because the Button et al paper doesn’t take into account the fact that the p value provides indirect information about the size of the sample effect? Or are the two papers talking about two different issues?
12 years ago
Good question. I think the papers are looking at different issues, but it’s subtle.
Button et al are holding the effect size constant, and looking at repeated experiments with different sample sizes and different power.
Wagenmakers is holding the p-value constant and varying the sample size, which varies the effect size to compensate.
Button et al are saying that positive results from underpowered studies are less likely to be true, compared to adequately powered studies of the same phenomena.
Wagenmakers is saying that positive results from small studies are more likely to be true, compared to results with the same p-value from larger studies (with different effect sizes, and so of different phenomena)
They are both correct, but on slightly different questions.
You might also be interested to know that Ken Rice has shown how to construct some common p-values as optimal Bayesian decisions (Rice K (2010) A Decision-Theoretic Formulation of Fisher’s Approach to Testing, The American Statistician, 64(4) 345-349, PDF) balancing false negatives against bias rather than against false positives.
12 years ago