We demonstrated a biased distribution of p values in abstracts listed in Medline 2012, with an apparent increase in p values lying immediately below 0.05 relative to the frequency immediately above 0.05. This finding is likely to be evidence of p-hacking (biased analysis and reporting) or publication bias. We expected a more uniform distribution of p values for randomized trials, based on the assumption that these trials are done where equipoise exists, indicating that there is no expected difference between the two groups. However, the distribution of p values in this group was similar to the overall group.
Although the graphic presentation of the p value distribution in this study shows a clear picture, the weakness of this study is the lack of statistical analysis. There is no theoretical or calculated true distribution that can be used to test against. The decision to use all p values may have introduced some bias due to the inclusion of similar (related) p values from single studies. We also plotted the distribution of the first recorded p values, but found this to be similar to the distribution of all reported p values, and as the latter contained more data, we chose to report all p values. The p values used in this study were only the p values reported in the abstracts. This was chosen because many readers only read the abstract, the abstract is the only part of the article available in Medline, and because the most important p values are likely to be reported in the abstract [13, 14]. We excluded p value ranges, but this probably led to an underestimation of the bias around 0.05, as the reported range was commonly given as “p < 0.05” and including these was likely to have added to the number of p values immediately below 0.05. During data collecting we observed some errors in the Medline database. P values greater than one were reported and some papers labeled as human were in fact non-human. These errors were infrequent and were unlikely to have influenced the results. It is possible that, given that papers often contain many p values, significant p values in the manuscript are more likely to be reported in the abstract than values above 0.05. However, we consider this to be another form of selective reporting bias, as only the most important outcomes (such as any primary, patient-important outcomes) should be reported in the abstract, regardless of significance.
Gøtzsche commented on the distribution of p values between 0.04 and 0.06, noting a higher than expected number of values below 0.05 [2]. His study used first-reported p values in abstracts and noted a higher proportion of significant p values in non-randomised studies compared to randomised trials. Gøtzsche noted that the high proportion of significant findings in randomised trials is unexpected, given the need for equipoise (presumed equivalence of treatment options) in clinical randomised trials, which, if present, would lead to an unskewed (flat) distribution of p values. He also noted that many of the significant p values were incorrectly reported or analysed. The findings of Gøtzsche are consistent with the presence of bias in analysis and reporting and consistent with findings of our study. Masicampo et al., in a study of p value distribution in the abstracts of three major psychology journals, showed that there were more p values immediately below 0.05 than expected, based on the p value distribution in other ranges [1]. The distribution they found matches the distribution in this study. Jager and Leek looked at five major medical journals and their reported p values over a decade, however they only reported the distribution of p values less than 0.05 [15]. The distribution reported was similar to our distribution, although their study did not provide any information on the relative frequency immediately below and above 0.05. Simonsohn et al. suggested in 2013 to use a “p-curve”, a graphic p value distribution as a tool to evaluate if the literature on a certain topic has been influenced by publication bias or p-hacking. They declare that a right skewed p-curve is evidence of biased analysis or selective reporting [4]. Ioannidis also concluded that significant p values were over-represented in a review of meta-analyses of neuroleptic agents for schizophrenia [16]. Apart from publications bias and bias in analysis and outcomes reporting, Ioannidis added data fabrication as another possible cause of an over-representation of significant p values.
Distinguishing between publication bias and methodological biases (bias in analysis, selective reporting and data fabrication) is difficult. Funnel plot asymmetry, often interpreted as evidence of publication bias, can also be explained by these other forms of bias, as p values are artificially lowered and effect estimates exaggerated [17]. However, we consider bias immediately adjacent to 0.05 (as shown in our study) more likely to be due to methodological biases (working to push the p value below the level of significance), than due to publication bias (which applies to all p values below 0.05, not necessarily those immediately below 0.05). Methodological biases (rather than publication bias) leading to an over-representation of lower p values is also consistent with findings of effect estimate exaggeration in research [18–20] and with problems relating to the replication of significant findings in the medical literature [21–23]. Our study implies that the reporting of p values in human research is biased. Further research should explore predictors of bias in the distribution of p values, such as study type, methodology, study size, and journal type. In a later study, we aim to report on the distribution of p values described as ranges, such as “p < 0.05”, and to report on possible predictors of significance.
There is some evidence that the quality of reporting abstracts has improved over time due to initiatives such as CONSORT [24–26]. However, the reporting requirements only apply to some studies (such as randomized trials) and do not exclude the possibility of methodological biases and selective reporting. Unfortunately there are many more reasons why science has been incapable to self-correct [27].
Reports of statistical significance in medical research influence clinical decision making. Bias in such reporting should be considered when interpreting information from abstracts. Prevention of bias in reported p values would require open and complete reporting of research protocols and methods (to avoid analysis and reporting bias), adjusted analyses due to multiple testing (allowing for the increased probability of finding significance), and publication of all research (to avoid publication bias) [28].
Comments
View archived comments (1)