Null hypothesis significance testing (NHST) has been a commonly used paradigm in biomedical and epidemiological research for estimating and testing effects of new interventions, diagnostic tests and other epidemiological associations such as risk factors for diseases for almost a century.1 The cornerstone of this paradigm is a concept of the p-value and of its corresponding frequently used threshold for rejection of the null hypothesis, or alpha level, of p<0.05.2
The concepts of NHST, p-values and the alpha levels have been criticised ever since they were first introduced, and in many noteworthy journals for the last 20 years.3 Nevertheless, they have been adopted as a standard and are used frequently in research and education.4 However, their sometimes indiscriminate use has led to misunderstanding of the principles underlying them, as well as instances of misuse and misinterpretation.5 These challenges have contributed to the replication crisis in which science currently finds itself.6 In biomedical research, sound statistical reasoning is paramount for practicing evidence based medicine.7 However, challenges in interpreting and using evidence from research can be aggravated by the problems attributed to p-values and NHST.
P- values are frequency conditional probabilities, defined over repetitions of the same study conditions.8 9 A frequent misinterpretation of p-values is that they represent the probability that the null hypothesis is true. P-values tell us about the probability of observing these data or more extreme, if the null hypothesis were true, rather than what is important to find out, which is the probability of the null hypothesis being true, given the data.9 P-values can be over trusted when we rely solely on them to inform us of treatment effects.5 Since p-values are so open to influence by sample size, choice of statistical model and other factors, it is not advisable to base evidence for making clinical decisions on a single p-value.10
In NHST, the choice of p<0.05 for rejection of the null hypothesis is completely arbitrary and made some sense when Fisher11 first proposed it, since science was limited in the early 20th century, and screening out 5% false positives was reasonable. However, nowadays, the number of hypotheses being tested each day is unfathomable, and thus p<0.05 has a low predictive value for the correct rejection of the null hypothesis.12 This false dichotomy of what is essentially a quantitative metric can lead to statistical errors. False positive risk is the probability that when the researcher decides a statistical test is positive (or the null hypothesis is rejected), there is no real effect, or the results occurred by chance.9 Additionally, a large enough sample size will lead to this same error.13 Too many false positive results eventually discredit and impede science.1 False findings may even represent the majority of published research.14
There is a general misinterpretation that statistical significance is the same as clinical (or practical) importance. A p-value, or statistical significance, does not measure the effect size, nor does it provide adequate evidence of an effect on its own.5 In clinical trials, when the trial protocol is designed, the hypotheses are formulated and sample sizes are estimated using the minimum clinically important difference (MCID). However, when analysis is conducted using NHST, the observed difference is statistically compared with zero rather than the MCID.15 The clinical importance of the observed difference is left up to interpretation by the reader. In biomedical research where effect sizes are often not known a priori, statistical analysis has been almost purely about statistical significance, with little regard for error rates. While this did lead to campaigns to augment the p-value with other metrics like 95% confidence intervale (CIs), CIs themselves also suffer from misinterpretation and are sometimes erroneously used as hypothesis tests based on the 5% cut-off.12 However, in their defence, reporting of CIs around the effect estimate does allow the reader to judge the clinical significance of the effect against what they consider as an important difference, even if they do not agree with the MCID stated by the authors of the study.
Publication of trials showing results which are statistically significant but not necessary clinically important is prevalent and could partly explain the high false positive rates in superiority trials.15 On the other hand, findings that are not statistically significant should also not necessarily be interpreted as no effect or equivalence.16 In biomedical science, where the end goal should be to improve healthcare and patient outcomes, both of these errors can have serious consequences.
The ‘publish or perish’ environment in which academics work perpetuates the misuse of statistics and can lead to publication bias17 whereby journals selectively publish studies showing small p-values, even in the presence of small effect sizes or minor clinical differences. Reward is largely driven by publication output, and sometimes researchers choose to do studies that they know will lead to ‘statistically significant’ results rather than studies that answer interesting, novel or urgent questions.18 This can lead to ‘p-hacking’ and other conscious or unconscious methods to falsely increase the likelihood of a p<0.05.19 A further factor which could exacerbate these problems is ‘modern celebrity culture’ which speaks to pressure from academic institutions seeking to increase their public profile for fundraising, as well as the fame and promotion prospects which come to the researcher for prominent publications.20
One study which examined the rate of false positive reporting in published randomised controlled trials using a sample of only 50 trials, found that 6% of trials had statistical significance but no clinical significance (false positive) while 74% were statistically significant but not necessarily clinically significant, and only 20% were both statistically and clinically significant.15 This study points the way to conducting a larger investigation into this phenomenon, and also assessing factors associated with the discordance.
Biomedical research is now facing a crossroads following a landmark statement paper by the American Statistics Association5 which detailed the shortcomings of p values and NHST, but did not propose a solution. There is a lack of consensus among statisticians around the best way forward.21 The volume of information in the form of scientific publications and opinion pieces, as well as the conflicting viewpoints on the ways forward is confusing for statisticians, let alone non-statistical clinical researchers or consumers of the research literature. The types of potential solutions posed often require a steep learning curve13 and it is also possible that a lot of resistance to reform will emerge from potential solutions.22
Overall improved reporting of research which incorporates alternative metrics as well as p-values in the data interpretation is required in order to advance science, especially if the goal of research is improved healthcare and patient outcomes. Jakobsen et al have proposed a useful five-step guideline for evaluating effects in clinical trials which includes CIs and p-values, Bayes factor, adjusting for early termination of the trial, adjusting for multiple tests and assessing clinical significance of the trial.23 Whatever improved or adjusted metrics are used to quantify scientific evidence from research, they should not replace good study design and meticulous reporting. Statistical testing still has its place along with other metrics and methods, but should be used objectively, with a ‘truth-seeking’ approach.20
The challenges inherent in NHST including all the p-value misinterpretations and misuses need to be addressed in biomedical research. Although these problems are not unique to clinical trials, they can be best studied in clinical trials because the guidelines applied when publishing trials include more rigour around reporting the sample size estimation and the assumptions underlying it including the MCID. Understanding the extent of the agreement between statistical and clinical significance in clinical trials warrants further investigation using rigorous methodology. Identification of the factors which are associated with discordant findings will also be beneficial to understanding how to prevent these from occurring. This study sets out to investigate the extent of the problem of disparity between statistically and clinically significant results in published clinical trials, and to assess the factors associated with this disparity.
MethodsStudy designA methodological study of published randomised controlled trials is planned. PubMed Medline database will be searched using the terms detailed in table 1 on 15 May 2023. The results will be filtered by each year (2018–2022) separately to produce subsets of searches for each of the 5 years, sorted by relevance. The first 200 yielded studies (title and abstract) from each year stratum will be screened for inclusion, based on the assumption that half the studies screened will meet inclusion criteria. If further studies are required for screening, the subsequent yielded studies on the list will be screened. The sampled study abstracts will be reviewed independently for eligibility according to the inclusion and exclusion criteria by two authors and consensus reached by discussion. Screening results will be entered and managed using an Excel spreadsheet.
Inclusion criteriaTwo-arm, phase three, superiority RCTs of health interventions on humans, published between 2018 and 2022 will be included in the study. Articles reporting the primary outcome of the trial will be included.
Exclusion criteriaProtocol publications, pharmacokinetic studies, pilot trials, systematic reviews and duplicate publications will be excluded.
Sample sizeBased on the estimated small percentage of events (disparate studies) of 10%, which was informed based on a pilot study conducted by the researchers, a sample size of 500 studies (100 per year) will have a precision (half-width) of a 95% CI around a 10% estimate of 2.75%. We considered an effect size (OR) of 1.5 and above to be clinically meaningful for a binary covariate of interest. However, it is unlikely we will achieve statistical significance of p<0.05 at this effect size with outcome probability set to 10%, R2 between covariates of 20% and 20% prevalence of the independent variable of interest. A sample size of 500 studies yields only 22% power at a 0.05 significance level to detect such a difference. We will have >80% power to detect ORs of 2.5 or higher using the same assumptions. However, since placing too much importance on covariates with p<0.05 can detract from other important factors associated with the outcome,24 we will concentrate on interpreting the size of the effects and their CIs rather than p-values in this analysis.
Data extraction and synthesisFor the included eligible studies full-text files will be obtained. Data extraction will be conducted by two review authors, independently and in duplicate. Discrepancies will be resolved by discussion between the two authors. We will extract and enter data into a pre-piloted REDCap (Research Electronic Data Capture) database. REDCap is an electronic data capture tool hosted at Stellenbosch University.25
The following data from the selection of published trials and their protocols, if available, will be entered:
Bibliometric information: authors, year of publication, journal, impact factor of the journal.
Funding source of the trial.
Type of disease condition or population under study.
Trial design (parallel or cross-over).
Unit of randomisation.
Intervention and control treatments.
Primary outcome and its measurement scale.
Trial protocol is published and/or accessible from websites/registries.
MCID used to calculate sample size (from the protocol or publication).
What the MCID represents to the authors or source of the MCID.
Sample size estimated and final sample size used.
Level of significance used.
Effect estimate.
CI of the effect estimate.
Number of events per treatment group/summary statistics of outcome measures per group.
P-value of the primary hypothesis test.
Conclusion of the researchers.
Methodological quality of the study: type of randomisation, evidence of adequate allocation concealment, blinding, sequence generation, incomplete outcome data, selective outcome reporting.
Clinical significance of the results.26
Classification of study results as either disparate or not (table 2).
Table 2Classification system of disparity based on statistical and clinical significance of the study primary outcome result
The primary outcome measure reported in the study will be used, and its point estimate and 95% CI will be extracted. Where more than one primary outcome is stated, the first mentioned primary outcome will be extracted.
In order to assess discrepancy between statistical and clinical significance, initially the clinical importance of the results will be categorised, using the MCID, point estimate (effect size) and it is CI, into definite, probable, possible or definitely not clinically important.26 Studies will be classified as definite clinical importance when the MCID is smaller than the lower limit of the CI, probable clinical importance when the MCID lies between the lower CI limit and the effect estimate, possible clinical importance when the MCID lies between the effect estimate and the upper CI, and definitely not clinically important when the MCID is greater than the upper CI limit of the effect estimate.26 The MCID in terms of absolute differences between groups as percentages in the case of binary variables, and mean differences in the case of measured outcome variables, will be reported as intervention minus control. For ratio measures of comparison (odds ratio, relative risk, hazard ratio), the ratio will always be expressed as intervention/control. Where studies report the MCID as a different measure of effect compared with how the effect size was reported for the primary outcome, the MCID will be converted (if possible) to the same type of measurement of effect size as the primary outcome. For example, if the MCID is measured as a risk difference or percentage difference in outcome between treatment groups, but the primary outcome was reported as a relative risk, the MCID will be then converted to a relative risk for ease of comparison. In studies which do not report MCID or where we cannot convert the MCID to the same measurement scale as the primary outcome, or where the point estimate or CI is not available, the clinical importance will be classified as indeterminate.
Those studies where there is statistical significance, and at least possible clinical importance (including definite, probable and possible) will be classified as no disparity. On the other hand, if the clinical importance is at least possible, and no statistical significance was obtained, those studies will be classified as disparity. Studies which are definitely not clinically important but statistically significant, will be classified as disparity, whereas the same category of studies with no statistical significance will classified as no disparity, as shown in table 2. Where clinical importance is indeterminate, discrepancy will also be classified as indeterminate. The combination of no statistical significance and definite clinical importance is not possible, since the MCID of a study where the 95% CI crosses the null will never be smaller than the lower limit of the CI.26
Statistical analysisThe data extracted will be used to estimate for all eligible studies or subgroups of studies (with the study as the unit of analysis) the proportion of studies that did not report MCIDs, and in those which published useable MCIDs, the proportion with discrepancy between clinical and statistical significance, overall and stratified by type of discrepancy (false positive or false negative).
Factors associated with discrepant results will be assessed using logistic regression analysis at the level of study. Studies where disparity was indeterminate will be excluded from the analysis. Factors which will be considered as independent variables are those which were found to be associated with trial fragility in a previous systematic review, namely journal impact factor, sample size, effect size, p-value and methodological quality of the study such as allocation concealment.27 Ioannidis14 postulated that smaller studies, those with smaller effect sizes, and those with greater conflicts of interest or financial prejudices, are most likely to generate untrue findings. We will also consider funding source and measurement scale of the primary outcome as factors. Crude ORs and 95% CIs will be estimated for each independent variable initially. Those variables with clinically important effects will be considered for a multivariable model. We will assess collinearity using variance inflation factor, and model goodness-of-fit by examining the residuals. The results of the final multivariable model will be reported as adjusted ORs and 95% CIs for factors. All analyses will be performed using Stata/MP V.17.0.
Patient and public involvementPatients and public were not involved in the development of the research question, design, conduct or dissemination plans of this study.
Comments (0)