Background: Interstitial lung disease (ILD) is the leading cause of death in patients with systemic sclerosis (SSc), affecting more than 40% of this population. Despite the availability of effective treatments to stabilize or improve lung function, survival for patients with SSc-ILD remains poor. Poor outcomes have been attributed to delayed diagnosis and initiation of treatment for SSc-ILD. Although recent guidelines have provided conditional recommendations for early screening, pulmonary function tests (PFTs) are insensitive for early diagnosis, and computed tomography (CT) - the current gold standard - often detects disease after irreversible lung injury has occurred. A single sensitive biomarker that can accurately predict the risk of SSc-ILD development and mortality is lacking. We hypothesized that applying machine learning (ML) methods to multiple features from readily available electronic health records (EHR) could construct a model to detect ILD and predict mortality in patients with SSc. Methods: We retrospectively analyzed EHR data from participants enrolled in a single-center registry of patients with SSc over a period of twenty-eight years (1995-2024). We applied a combination of ML models to seventy-four clinical features encompassing demographics, clinical history, PFTs, and laboratory results. The resultant models were tasked with detecting ILD and predicting mortality in participants with SSc. Results: 1,169 participants with SSc were included in this study, spanning 15,494 person-years of observation. Models detecting ILD achieved an AUC of 0.818 and confirmed the importance of known biomarkers, such as autoantibodies and PFTs, as risk factors for SSc-ILD. Unexpected clinical values including white blood cell count and mean corpuscular volume were also important for model prediction of SSc-ILD. For prediction of one-year all-cause mortality, models reached an AUC of 0.903. In a subgroup analysis of those with prevalenet radiographic SSc-ILD, three-year all-cause mortality prediction reached an AUC of 0.831. These models identified features strongly associated with mortality that are routinely collected during clinical assessment of patients with SSc, including unexpected associations with values such as red cell distribution width and serum chloride concentration. Conclusions: ML-based analysis of clinical features and laboratory tests collected as part of routine clinical care detect ILD and predict mortality in patients with SSc.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementThis research was supported in part through a generous gift from K. Querrey and L. Simpson. This research was also supported by the computational resources and staff contributions provided for the Quest high-performance computing facility at Northwestern University, which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology. This research was also supported in part through the computational resources and staff contributions provided by the Genomics Compute Cluster, which is jointly supported by the Feinberg School of Medicine, the Center for Genetic Medicine, Feinberg's Department of Biochemistry and Molecular Genetics, the Office of the Provost, the Office for Research, and Northwestern Information Technology. The Genomics Compute Cluster is part of Quest, Northwestern University's high-performance computing facility, with the purpose of advancing research in genomics. N.S.M. was supported by the American Heart Association (grant no. 24PRE1196998). M.H. was supported by the NIH (grant no. R01AR07327). C.A.G. was supported by the NIH (grant no. K23HL169815), a Parker B. Francis Opportunity Award, and an American Thoracic Society Unrestricted Grant. R.G.W. is supported by the NIH (grant nos. U19AI135964, R01AI158530, R01HL149883, P01HL154998, U01TR003528). G.R.S.B. was supported by a Chicago Biomedical Consortium grant, Northwestern University Dixon Translational Science Award, Simpson Querrey Lung Institute for Translational Science, the NIH (grant nos. P01AG049665, P01HL154998, U54AG079754, R01HL147575, R01HL158139, R01HL147290, R21AG075423 and U19AI135964), and the Veterans Administration (award no. I01CX001777). A.V.M. was supported by the NIH (grant nos. U19AI135964, P01AG049665, P01HL154998, U19AI181102, R01HL153312, R01HL158139, R01ES034350 and R21AG075423). A.A. was supported by the NIH (grant nos. U19AI135964 and R01HL158138) and Simpson Querrey Lung Institute for Translational Science. A.J.E. was supported by the NIH (grant no. L30HL149048). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
IRB of Northwestern University gave ethical approval for this work (STU00002669).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityAll data produced and analyzed in the present work are currently not publicly available due to participant privacy. The corresponding author will consider reasonable requests for data on an individual basis. The code used for data processing and model development is available on GitHub.
Comments (0)