March 9, 2016
The arbitrary magic of p < 0.05
A limerick by Roderick Little highlights the importance of recent events in the statistical community:
In statistics, one rule did we cherish:
P point oh five we publish, else perish!
Said Val Johnson, “that’s out of date,
our studies don’t replicate
P point oh oh five, then null is rubbish!”
This hits close to home for us in the Sustainable Transportation Lab, as statistical analysis is at the heart of much of our research. On Monday, the American Statistical Association (ASA) released a consensus statement to address the misuse of p-values and promote a better understanding of them among researchers. This turns out to be a very important and timely action, since researchers have been increasingly criticized for using blind, mechanistic application of significance testing to provide a patina of scientific legitimacy to fundamentally weak research practices.
One of the goals of the statement is to provide a definition of p-value that is straightforward and intuitive, which is not an easy task: An interview among a group of scientists shows that even among the world leading research experts, who could provide a definition of a p-value, it is hard to get a clear interpretation of p-values and statistical significance in plain language.
The definition provided by the ASA’s statement is: “Informally, a p-value is the probability under a specific statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal or more extreme than its observed value”. More helpful than this definition, however, is the first of their six principles: “p-values can indicate how incompatible the data are with a specified statistical model.”
Significant or not? The struggle was real.
Statistics for Dummies, summarizes how p-values are commonly applied and interpreted:
- P-value smaller than or equal to 0.05 indicates strong evidence against the null hypothesis, so it should be rejected;
- P-value larger than 0.05 indicated weak evidence against the null- hypothesis, so it cannot be rejected;
- P-value very close to 0.05 means it can go either way.
Where did the p<0.05 criterion come from?
The reason why there is such a variety of interpretation of p-value is that it is hard to interpret – not only to the researchers nowadays, but also to Ronald Fisher, the very person who himself popularized p-value in statistics as a research tool. Fisher proposed the use of “significant” to describe small p values, which as pointed out by Steven Goodman means “something worthy of notice”. In his book Statistical Methods for Research Workers, Fisher’s exact words on the use of p-value and the significance level of 5% are:
“Personally, the writer prefers to set a low standard of significance at 5 percentage point… A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”
So for Fisher, the choice of 5% as the threshold is nothing more than an arbitrary “personal choice”.
Why are we so stuck on p<0.05?
The ASA statement includes an exchange that illustrates this, from George Cobb, Professor Emeritus of Mathematics and Statistics at Mount Holyoke College on the ASA discussion forum:
Q: Why do so many people still use p ≤ 0.05?
A: Because that’s what they were taught in college or grad school.
Q: Why do so many colleges and grad schools teach p ≤ 0.05?
A: Because that is still what the scientific community and journal editors use.
The bottom line is that in some scientific areas, p < 0.05 becomes a threshold of whether a study gets published or not. The ubiquitous use and confusion of interpretation lead to a wide range of colorful explanation of p-values. The result seems clear to most when p is a lot smaller or a lot bigger than 0.05, but when it is around that magical number 0.05, that is when people get really obscure yet creative: p=0.073 is a “barely detectable statistically significant difference”; p=0.054 means “approached acceptance levels of statistical significance”; p=0.07 indicates “marginal significance”; p=0.1 is interpreted as “loosely significant”. You can find a whole list here. Through a key words search in google scholar, Matthew Hankins collected the p-values that were claimed as “marginally significant”. The phrase is rarely used when p-value is smaller than 0.05. The frequency peaks at p=0.06 then decreases until another peak around p=0.1. Hankins argued that 0.05 should be the hard threshold of significance: “if your p-value remains stubbornly higher than 0.05, you should call it ‘non-significant’ and write it up as such”. Hmmm, but why?
Statistically significant? So what! – Several misconceptions of p-value
The ASA statement clarifies several myths of p-value:
- P-value means the probability of the data given the null hypothesis, not the probability of the hypothesis given the data. So it does not measure the probability that the studied hypothesis is true.
- Even though one achieves statistical significance according to the popularly accepted significance level (i.e. p-value < 5%), it means nothing in practical perspectives without consideration of effect size because the p-value carries no information about the magnitude of an effect, which is captured by the effect estimates and confident intervals. A lot of scientific papers should pay more attention to this fact.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. The evidence from a given study needs to be combined with that from prior work to generate a conclusion: the null hypothesis could still be true even after a significant result – this is the general consideration of Bayesian approach but also in accordance with Fisher’s comment: “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance”, which means if subsequent studies also yield significant p values, it could be concluded that the observed effects are not likely to be the results of chance alone.
The take away
- P-values should not be the “badges of the truth“
As a practical statistical tool, p-value is important. However it should not be overly embraced as the single criterion of research papers. Multiple measures such as confidence intervals, prediction intervals, Bayes factors should be explored when appropriate. As pointed out by Andrew Gelman: “The solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.”
- P-value = 0.05 should cease being the magical line
The arbitrary threshold of p<0.05 causes the selective report of research results, however proper inference requires full reporting and transparency. As pointed out by the ASA statement, “Cherry-picking promising findings, also known by such terms as data dredging, significance chasing, significance questing, selective inference and p-hacking” should be avoided.
Recent Comments