Causal relationships are essential for guiding effective interventions, especially in high-stakes settings like healthcare. The large amount of individual patient data collected through Electronic Health Record (EHRs) and patient-generated sources provides a unique opportunity for learning causal relationships, such as causes of complications or side effects. However, a fundamental challenge to causal inference in healthcare is that patients have different variables collected, since what is measured depends on each patient’s health status. For example, an expensive test may be done only when a particular diagnosis is suspected, and some invasive brain monitoring is only performed in unconscious patients. Having different variables across patients is common across health data types (e.g., EHRs; Intensive Care Unit [ICU] data streams, free-living data captured by different patients), and is a natural consequence of how data is collected. Outside of clinical trials where measurements are standardized it would be expensive, inefficient, and unethical to require the same data to be collected for all patients.
Yet this creates major obstacles to using this data to learn causal relationships. Fig. 1 illustrates this situation, and why measuring different variables poses a challenge. Each patient has a subset of variables measured, leading to incomplete or incorrect relationships when inference is done separately for each patient (e.g., incorrectly inferring X2→X3 for patient 1 due to X1 being latent). Even when variables are measured for a particular patient, correct relationships can be missed if the variables are not measured at the correct time scale due to variation in sampling granularity (e.g., X1→X2 missing for patient 6 could be because X1→X2 in 1-2 h, but X2 was not measured in that period). This raises the question: how can we learn causal models from multiple patients’ data when each patient has a different set of variables?
Heterogeneity in clinical data, such as differences in terminology or format, has traditionally been addressed with data harmonization, which aims to align multiple data sets such as mapping different EHR datasets to one data model (e.g., OMOP CDM) [1], [2]. This allows data to be integrated for analysis. While this is a critically important step, it focuses on standardizing data that is available, leaving open the problem of latent (unmeasured) variables. Causal inference methods to handle latent variables in time series data such as time series fast causal inference (tsFCI) [3] assume that a variable is latent or not across a dataset but cannot handle the case where variables are fully missing in some datasets but not in others. Additionally, while there are methods for causal inference across multiple datasets, they have not been extended to time series data. Further, these methods assume that conflicting relationships do not occur [4], [5] and do not adequately account for how statistical errors like missing time points (i.e., missing data within a measured variable) and low sample size affect our ability to learn correct causal relationships [6]. Existing methods also generally can handle few variables due to their computational complexity, which is unrealistic for healthcare settings which may have dozens or hundreds of variables with complex causal relationships among them.
To address this, we introduce Causal Model Combination for Time Series (CMC-TS), which learns causal models from time series datasets with partially overlapping variable groups. We use the partial overlap across patients as a feature, not a bug: by sharing causal information across datasets (i.e., patients) we can ultimately learn a single causal model from all datasets. Our approach is iterative as we seek to reduce the number of unique variable groups by combining datasets when they share the same variables, correct errors due to missing time points within a dataset via a weighted inference approach, and reconstruct missing variables to avoid confounding. Experimentally, we demonstrate that CMC-TS has the best F1-score and False Discovery Rate (FDR) on simulated data. We apply the approach to real data from patients with subarachnoid hemorrhage (SAH), a type of stroke, to uncover causes of low Partial Brain Tissue Oxygenation (PbtO2).
Statement of Significance
Comments (0)