Electronic health records (EHR) contain a wide range of patient information, serving as a valuable resource for scientific health research. Over the past decades, EHR data have been widely used for generating clinical-based evidence across many disease domains and research fields [[1], [2], [3], [4]]. One common use of EHR data is the identification of novel risk factors through association studies [5], with potential applications in various fields including drug repurposing [6,7], pharmacovigilance [8,9], and pharmacoepidemiology [10].
However, there are challenges associated with using EHR data for association studies. The first major challenge is the presence of measurement error in commonly used algorithm-derived phenotypes [11]. Algorithm-derived phenotypes are determined by automated computational methods (Fig. 1), including expert-designed rule-based algorithms [[12], [13], [14], [15]] and recently emerged machine-learning approaches [[16], [17], [18], [19]]. Despite significant efforts in developing high-performance phenotyping algorithms [[18], [19], [20], [21]], misclassification errors can still occur in the algorithm-derived phenotypes, which are referred to as surrogates or surrogate outcomes hereafter. The misclassification errors in the surrogates may result in systematic bias [[22], [23], [24]], inflation of the type I error [25], and reduction of statistical power [26]. Consequently, such limitations would reduce the reproducibility of EHR-based clinical findings [27]. To address the misclassification errors, various methods have been developed. For example, Magder and Hughes (1997) [28] proposed a method to correct association estimates using estimated misclassification rates (i.e., sensitivity and specificity) from a validation set. Edward et al. (2013) [29] developed a method using multiple imputations to account for misclassification when validation data are available. However, these methods often depend on the correct specification of the misclassification rate, which can be difficult with the complex nature of EHR data [[28], [29], [30], [31], [32]].
In addition to using error-prone algorithm-derived phenotypes in association studies for risk factor identification, an alternative approach is to obtain the phenotypes through manual chart review (Fig. 1). During the manual chart review process, clinicians thoroughly assess each patient’s entire health record and assign a label indicating whether the patient has a specific phenotype based on their expert evaluation. Phenotypes verified through manual chart review are often considered the gold standard phenotypes and provide unbiased estimates of the risk factors’ coefficients for association studies [[33], [34], [35], [36]]. However, the chart review process is time-consuming and expensive, limiting its affordability to a small subset of patients, referred to as the validation set. Using a small validation set for association studies often results in inefficiencies in the estimators of the coefficients. To fully utilize both algorithm-derived and chart-reviewed phenotypes, an augmented estimation method was proposed by Tong et al. (2020) [37]. Their proposed method effectively integrates information from both chart-reviewed and algorithm-based phenotypes, without requiring correct specification of the misclassification rate of the phenotyping algorithm.
Though the augmented estimation method described above effectively combines two types of phenotypes, another challenge in the real-world setting limits its practical application. Such a challenge is that, in practice, manual chart review may fail to provide definitive answers when the phenotypes are difficult to determine accurately [38]. Reviewers may encounter unclear or missing information [39] or yield inconsistent results when reviewing clinical notes or diagnoses [40]. Consequently, in addition to the binary categories of “yes” and “no”, a third category—often labeled as “undecided”, “uncertain”, or “maybe”—may emerge during manual chart reviews [[38], [39], [40]] as illustrated in Fig. 1. This “undecided” category is commonly observed in practical settings. A motivating example is the investigation of the impact of bias in computing phenotypes on drug signal detection for Alzheimer's disease and related dementias (ADRD) patients [41], a subset of 384 patients’ medical records was reviewed by domain experts and physicians to verify the presence of ADRD. Among them, 55.2 % of the patients had indeterminate phenotypes and were labeled “undecided”. Discarding these “undecided” cases directly reduces the sample size of the validation set, which is already quite small. This reduction further leads to inefficiency and diminished statistical power for the corresponding estimator based on the validation set. Therefore, it is critical to develop a method that fully utilizes all clinicians-labeled patients, including those whose phenotypes are difficult to determine.
In addition to the trinary chart-reviewed phenotypes, the third key challenge to address in the real-world setting is the presence of rare diseases or events. Rare diseases, like systemic lupus erythematosus (SLE) [42], which has an incidence rate of 5.1 per 100,000 person-years [43], are commonly of interest in EHR-based association studies. The rarity of events induces a class imbalance in algorithm-derived phenotypes, leading to further imbalances in the randomly sampled validation set used for chart reviews. To address this, outcome-dependent sampling is a well-known strategy aimed at enriching the cases in the validation set [[44], [45], [46], [47]]. An augmented estimation method has been developed to handle the rare event scenario using both algorithm-derived and chart-reviewed phenotypes; however, this method cannot accommodate trinary chart-reviewed phenotypes [47]. In the motivating example of ADRD, excluding the 212 patients labeled as “undecided” by chart review would lead to a waste of resources both in data collection and in the manual chart review process. Therefore, a novel approach is needed that accommodates the indeterminate category in manual chart reviews, considers the rarity of events, integrates both algorithm-derived and chart-reviewed phenotypes, and meanwhile generates reliable estimates for the coefficients of risk factors.
In this paper, we proposed a trinary chart-reviewed phenotype integrated cost-effective augmented estimation (TriCA). The key innovations of TriCA involve the inclusion of patients with the “undecided” label from manual chart review, and a biased sampling strategy to construct an augmented estimator, which optimally combined algorithm-derived and chart-reviewed phenotypes. By considering the trinary chart-reviewer phenotypes—rather than only simplistic binary categories, the proposed method provides a solution to address the scenario that aligns more closely with real-world settings. Additionally, by adopting the outcome-dependent sampling strategy, the method utilizes a more balanced validation set to develop robust and reliable estimators. This enriched dataset, containing a higher proportion of cases, allows for a more efficient estimation process, effectively addressing the challenge of the limited availability of disease cases.
Statement of significance
Comments (0)