What Are the Implications of Alternative Alpha Thresholds for Hypothesis Testing in Orthopaedics?
D. C. Landy, T. J. Utset-Ward, M. J. Lee, University of Chicago, Chicago, IL, USA
D. C. Landy, Department of Orthopaedic Surgery and Rehabilitation Medicine, University of Chicago Medicine & Biological Sciences, 5841 S. Maryland Ave. Rm. P-211, MC3079, Chicago, IL 60637 USA, Email: firstname.lastname@example.org
Received 2019 Jan 4; Accepted 2019 May 8.
Copyright © 2019 by the Association of Bone and Joint Surgeons
This article has been cited by other articles in PMC.
Clinical research in orthopaedics typically reports the presence of an association after rejecting a null hypothesis of no association using an alpha threshold of 0.05 at which to evaluate a calculated p value. This arbitrary value is a factor that results in the current difficulties reproducing research findings. A proposal is gaining attention to lower the alpha threshold to 0.005. However, it is currently unknown how alpha thresholds are used in orthopaedics and the distribution of p values reported.
We sought to describe the use of alpha thresholds in two orthopaedic journals by asking (1) How frequently are alpha threshold values reported? (2) How frequently are power calculations reported? (3) How frequently are p values between 0.005 and 0.05 reported for the main hypothesis? (4) Are p values less than 0.005 associated with study characteristics such as design and reporting power calculations?
The 100 most recent original clinical research articles from two leading orthopaedic journals at the time of this proposal were reviewed. For studies without a specified primary hypothesis, a main hypothesis was selected that was most consistent with the title and abstract. The p value for the main hypothesis and lowest p value for each study were recorded. Study characteristics including details of alpha thresholds, beta, and p values were recorded. Associations between study characteristics and p values were described. Of the 200 articles (100 from each journal), 23 were randomized controlled trials, 141 were cohort studies or case series (defined as a study in which authors had access to original data collected for the study purpose), 31 were database studies, and five were classified as other.
An alpha threshold was reported in 166 articles (83%) with all but two reporting a value 0.05. Forty-two articles (21%) reported performing a power calculation. The p value for the main hypothesis was less than 0.005 for 88 articles (44%), between 0.05 and 0.005 for 67 (34%), and greater than 0.05 for 29 (15%). The smallest p value was between 0.05 and 0.005 for 39 articles (20%), less than 0.005 for 143 (72%), and either not provided or greater than 0.05 for 18 (9%). Although 50% (65 of 130) cohort and database papers had a main hypothesis p value less than 0.005, only 26% (6 of 23) randomized controlled trials did. Only 36% (15 of 42) articles reporting a power calculation had a p value less than 0.005 compared with 51% (73 of 142) that did not report one.
Although a lower alpha threshold may theoretically increase the reproducibility of research findings across orthopaedics, this would preferentially select findings from lower-quality studies or increase the burden on higher quality ones. A more-nuanced approach could be to consider alpha thresholds specific to study characteristics. For example, randomized controlled trials with a prespecified primary hypothesis may still be best evaluated at 0.05 while database studies with an abundance of statistical tests may be best evaluated at a threshold even below 0.005.
Surgeons and scientists in orthopaedics should understand that the default alpha threshold of 0.05 represents an arbitrary value that could be lowered to help reduce type-I errors; however, it must also be appreciated that such a change could increase type-II errors, increase resource utilization, and preferentially select findings from lower-quality studies.
Although the lack of an association can be important, medical research most often focuses on identifying an association, a preference which itself contributes to erroneous findings [ 13 , 18 ]. The existence of an association is usually assessed using frequentist statistics (such as p values) to reject a null hypothesis, which typically is defined as “no association.” Though limitations of this approach are well described, this continues to be the framework for conducting clinical orthopaedic research [ 7 , 19 ]. Even for noninferiority trials and other emerging study types with varying null hypotheses, the same general testing framework often is used [ 22 ]. Unfortunately, we and others suspect that a fair amount of published medical research may be false, as many findings have been difficult to reproduce in subsequent studies [ 8 , 10 ].
When a null hypothesis is falsely rejected and an association falsely identified, this is referred to as a type-I error. The probability of type-I error is defined as alpha. When a null hypothesis is falsely not rejected and an association falsely missed, this is referred to as type-II error. The probability of type-II error is beta. There is a statistical tradeoff with decreasing alpha producing increasing beta. Even though an argument could be made to adjust the alpha threshold based on the clinical implications of different error types, this is rarely done. Another important but often overlooked factor is the effect size, or delta, of an association. Associations with larger effect sizes require fewer observations to generate statistical precision and are less vulnerable to type-II error. While many factors can influence whether an association is reproducible, much recent attention has been paid to the level at which statistical tests are evaluated, which is referred to as an alpha threshold or p-value threshold [ 11 , 14 ]. Traditionally, a value of 0.05 has been used to reject a null hypothesis, although this value is not evidence based or supported by statistical logic [ 5 , 6 , 16 ]. Recently, several proposals have been made to address this issue including one to change the threshold at which hypothesis tests are evaluated from 0.05 to 0.005, which has gained considerable attention [ 1 , 11 ].
Therefore, we sought to describe the use of alpha threshold values in two orthopaedic journals by asking (1) How frequently are alpha threshold values reported? (2) How frequently are power calculations reported? (3) How frequently are p values between 0.005 and 0.05 reported for the main hypothesis? (4) Are the p values less than 0.005 associated with study characteristics such as design and reporting power calculations?
Materials and Methods
Starting from August 2017, we obtained the 100 most recently published original clinical research articles from a leading general interest orthopaedic journal and a leading subspecialty orthopaedic journal: The Journal of Bone and Joint Surgery and the American Journal of Sports Medicine, respectively. Any nonresearch articles such as editorials and commentaries were excluded as were basic science articles and reviews. Of the 200 total articles (100 from each journal), 23 were randomized controlled trials, 141 were cohort studies, and 31 were database studies (Table (Table11 ).
Open in a separate window
The articles were reviewed and coded for study characteristics, including study design, study type, sample size, and the type of main exposure and outcomes used. Study design was classified as experimental if the investigator selected the exposure. In all cases, this was a randomized controlled trial (RCT). We classified the study design as a cohort or case series if the authors had access to original data collected for the study purpose such that patients were included as a part of the study reported. If authors did not have access to the original data and patients were recruited for purposes other than the study, we classified it as a database study.
For studies that did not specify a primary hypothesis, we selected a main hypothesis that was the most consistent with the title and abstract. The term hypothesis is used more generally here to refer to any research question tested using a hypothesis framework including questions that were explored retrospectively. Characteristics of alpha-threshold reporting include whether a specific value was stated, whether multiple hypotheses were tested, and whether alpha threshold values were adjusted. We recorded characteristics of power calculations including timing of the calculation and whether alpha and beta were specified. The p value associated with the main hypothesis and the lowest p value for each study were then recorded. P values were grouped as greater than 0.05, between 0.05 and 0.005, and less than 0.005 based on the tradition of using a threshold of 0.05 and the recent proposal of using 0.005.
Associations between study characteristics and the p value for the main hypothesis were assessed using proportions. No formal hypothesis testing was planned, and we did not perform an a priori power calculation. However, we understand some readers may be interested in p values even understanding their limitations. Given this, we used the Fisher’s exact test to calculate p values for the associations between study characteristics and the main hypothesis p value group. Given the exploratory nature of this, we are not declaring an alpha threshold nor are we adjusting for multiple hypothesis testing [ 21 ]. As no human research subjects were involved in this study, we did not seek institutional review board approval. There was no external funding source.
Of the 166 articles (83%) that reported an alpha threshold, 99% (164) used 0.05. Few articles (15%, 30 of 200) specified a primary hypothesis though most (93%, 186 of 200) tested multiple hypotheses (Table (Table2).2 ). When an alpha threshold adjustment was made, it was often not clear whether the threshold reported was only applied to post-hoc tests following an omnibus test or whether it was used to adjust a family-wise error rate for the entire study, the combined rate of type-I errors across multiple tests.
Open in a separate window
Only 21% (42 of 200) articles performed a power calculation, of which 69% (29 of 42) were performed a priori (Table (Table3).3 ). Of the 42 articles reporting a power calculation, most solved for sample size and were a priori although several were post hoc and solved for power. The alpha and beta values used in the calculation were frequently but not always specified. Of the 27 studies that specified a beta value a priori, 25 sought 80% power and two sought 90% power.
Open in a separate window
The p value associated with the main hypothesis was between 0.005 and 0.05 for 34% (67 of 200) articles (Table (Table4).4 ). The actual value was not specified in 8% (16 of 200) articles. Additionally, the smallest p value was between 0.005 and 0.05 for 20% (39 of 200) articles.
Open in a separate window
We found that p values less than 0.005 for the main hypothesis were associated with cohort studies (65 of 130, 50%) versus RCTs (6 of 23, 26%) and for studies not mentioning power (73 of 142, 51%) versus those mentioning power (15 of 42, 36%) (Table (Table5).5 ). Additionally, studies with a categorical exposure (29 of 161, 18%) were more likely to have a p value greater than 0.05 than studies with a continuous exposure (0 of 23, 0%).
Open in a separate window
Clinical research in orthopaedics typically reports the presence of an association after rejecting a null hypothesis of no association using an arbitrary alpha threshold of 0.05, which may lead to difficulties reproducing some findings. A proposal is gaining attention broadly across medicine to lower the alpha threshold to 0.005, although it is unknown how alpha thresholds are used in orthopaedics and the distribution of p values reported is unknown. In our review of recent clinical research from two main orthopaedic journals, we found that few studies specified a primary hypothesis, although approximately one in three reported a p value for the main hypothesis between 0.005 and 0.05. This suggests that lowering the alpha threshold at which the null hypotheses is rejected would result in up to one-third of orthopaedic research findings changing from having supported the existence of an association to not supporting it.
This study had several limitations. First, we emphasize that it is a descriptive study. Without knowledge of the truth behind the associations of individual articles studied, it is not possible to assess how type-II error would increase with this change in interpretation. Although much work related to this tradeoff has emerged from the statistical methods underlying genetic research [ 12 ], there is much left to learn about how to apply such knowledge in orthopaedics. So although not assessed here, another important factor to consider with any proposal to decrease the alpha threshold is that it would simultaneously increase type-II error if the sample size was held constant. Also relevant to the use of alpha thresholds in hypothesis testing, effect size was not explored here. Another limitation was the need to assign a main hypothesis to the many articles that lack a formally stated primary hypothesis. It is possible that for some articles, the hypothesis with the lowest p value was given the most attention despite having not initially been the hypothesis of greatest interest. However, this underscores how lowering the alpha threshold may lead researchers to more heavily focus on statistical rather than clinical results. It is also worth mentioning again that the term hypothesis is used less formally in this manuscript, with few of the articles reviewed having actually prespecified a prospective hypothesis to test.
The findings are also limited by the fact that they came only from two journals. It is possible that the rate at which alpha thresholds and power calculations are reported in other journals may be lower. It is also possible that articles in the journals may report a different distribution of p values. However, focusing on work published in these leading journals can help estimate the impact the proposal to change the alpha threshold would have even on the most impactful work. Further and related to the variability of research across orthopaedics, it should be mentioned that formal hypothesis testing with the rejection of a null hypothesis is likely not necessary across all or even much of the clinical research performed in the field.
Although most articles (164 of 200) reported an alpha threshold of 0.05, only 15% (30 of 200) articles reported a primary hypothesis despite more than 90% (186 of 200) reporting multiple hypothesis tests and less than 10% (14 of 200) adjusting their study-wise alpha threshold. Given this, simply altering the alpha threshold would not produce the fully intended effect because most articles would still have a family-wise error rate greater than the proposed alpha threshold of 0.005. Additionally, it is possible that many articles tested hypotheses that are not reported, which would only further contribute to an inflated type-I error rate. This concern is less relevant to studies with preregistered primary hypotheses, which highlights the importance of prospective trial registration [ 15 , 17 , 20 , 24 ]. With a prespecified primary hypothesis, the type-I error rate is controlled by eliminating the possibility that multiple hypotheses were tested and either not reported or that the primary hypothesis was defined based on the result of having a p value below the alpha threshold.
Only about one in five studies (42 of 200) reviewed provided a power or sample-size and even among these, only two-thirds were performed a priori. This is important because without power and sample size calculations, it is difficult to control type-II error. Related to this is the tradeoff between type-I and type-II errors when considering a proposal specific to type-I error. Although for prospectively conducted research with a priori sample size calculations, it is possible to reduce type-I error without increasing type-II error, but this requires increased sample sizes. As an example and understanding that the magnitude of change is modified by multiple factors, had clavicle fixation RCT from Canada used an alpha threshold value of 0.005 instead of 0.05, 102 patients per arm would have been needed versus 60, a 70% increase [ 4 ].
In the studies reviewed, RCTs were less likely to report a p value of less than 0.005 compared with cohort and database studies. A recent review of RCTs reported in major medical journals found that 30% of those with a p value of less than 0.05 did not have a p value less than 0.005 [ 23 ]. This is important because, as mentioned, lowering the alpha threshold would lead to increased resources to power future studies to a new threshold of 0.005 [ 9 ]. Additionally, this suggests that lower-quality studies, such as large database studies, will be emphasized initially.
To avoid differentially affecting higher-quality studies or increasing the burden placed on researchers doing hypothesis-guided prospective work, one possible solution would be to use alpha thresholds specific to study characteristics. For instance, the primary hypothesis from an appropriately registered RCT may still be best evaluated using an alpha threshold of 0.05 while a hypothesis studied using a large administrative database may be best evaluated at a threshold a fraction of 0.005 [ 2 ]. While this approach would be accompanied by its own limitations, it benefits in theory from drawing on Bayesian type principles of pretest probability while remaining in the frequentist framework [ 3 ]. If study characteristic-specific alpha thresholds could be agreed upon, this could reduce the burdens and limitations associated with study-specific alpha thresholds that may be too difficult to agree upon [ 14 ]. An additional practical measure that might also help would be to focus on registering studies and evaluating only specified primary hypothesis tests at a given alpha value [ 15 ].
We found that not only do one-third of clinical orthopaedic research articles report a p value between 0.005 and 0.05 for their main hypothesis but also that those articles with a p value less than 0.005 are more likely to be database studies and without power calculations. Although a lower alpha threshold may theoretically increase the reproducibility of research findings across orthopaedics, this would preferentially select findings from lower-quality studies or increase the burden on higher-quality studies. A more nuanced approach could be to consider alpha thresholds specific to study characteristics. For example, RCTs with a prespecified primary hypothesis may still be best evaluated at 0.05 while database studies with an abundance of statistical tests may be best evaluated at a threshold even below 0.005.
Each author certifies that he has no commercial associations (eg, consultancies, stock ownership, equity interest, patent/licensing arrangements, etc) that might pose a conflict of interest in connection with the submitted article.
All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.
Clinical Orthopaedics and Related Research® neither advocates nor endorses the use of any treatment, drug, or device. Readers are encouraged to always seek additional information, including FDA approval status, of any drug or device before clinical use.
Each author certifies that his institution waived approval for the reporting of this investigation and that all investigations were conducted in conformity with ethical principles of research.
1. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, Cesarini D, Chambers CD, Clyde M, Cook TD, De Boeck P, Dienes Z, Dreber A, Easwaran K, Efferson C, Fehr E, Fidler F, Field AP, Forster M, George EI, Gonzalez R, Goodman S, Green E, Green DP, Greenwald AG, Hadfield JD, Hedges LV, Held L, Ho TH, Hoijtink H, Hruschka DJ, Imai K, Imbens G, Ioannidis JPA, Jeon M, Jones JH, Kirchler M, Laibson D, List J, Little R, Lupia A, Machery E, Maxwell SE, McCarthy M, Moore DA, Morgan SL, Munafó M, Nakagawa S, Nyhan B, Parker TH, Pericchi L, Perugini M, Rouder J, Rousseau J, Savalei V, Schönbrodt FD, Sellke T, Sinclair B, Tingley D, Van Zandt T, Vazire S, Watts DJ, Winship C, Wolpert RL, Xie Y, Young C, Zinman J, Johnson VE. Redefine statistical significance. Nature Human Behaviour. 2018;2:6-10. [ PubMed ] [ Google Scholar ]
2. Berger VW. On the generation and ownership of alpha in medical studies. Control Clin Trials. 2004;25:613-619. [ PubMed ] [ Google Scholar ]