Objective To develop and evaluate an open-source machine learning (ML) models for predicting hospital short stays (length of stay [LOS] under 48 and 72 hours) exclusively using data available at the time of ED admission, with a novel application of target encoding diagnostic codes.
Materials and Methods We trained two ML algorithms (Random Forest and XGBoost) on electronic health record (EHR) data from two hospitals to predict hospital short stays. We employed an innovative weighted target encoding method that converted categorical International Classification of Disease (ICD- 10) codes into numeric representations of their probabilistic contribution to LOS. We measured area under the receiver operating characteristic curve (AUC) for correctly predicting LOS under 48 or 72 hours, which we compared to logistic regression.
Results The final sample included 8,693 adult patients admitted to an internal medicine service. Random Forest models achieved the highest performance for predicting LOS under 48 hours (AUROC=0.96, 95% CI 0.95-0.97; accuracy=91%) and under 72 hours (AUROC=0.94, 95% CI 0.93-0.95; accuracy=88%). These models outperformed logistic regression using the same features (48-hour AUROC=0.57, 95% CI 0.54-0.59 and accuracy=70%; 72-hour AUROC=0.59, 95% CI 0.57-0.61 and accuracy=56%).
Discussion Leveraging an innovative target encoding method, the Short Hospitalization Prediction (SHoP) model substantially outperforms previous ML approaches in accurately predicting LOS under both 48 and 72 hours using only ED pre-admission data (AUC 0.94-0.96).
Conclusion The technical innovation and predictive capability of the SHoP model enables powerful, real-time applications for optimizing patient flow and hospital resource utilization by identifying potentially divertible admissions while patients are still in the ED.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementDr. Leuchter is supported by funding from the NIH-NHLBI. Dr. Gabel is a co-founder of Extrico Health Inc., a healthcare analytics company. The funders/companies had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The UCLA IRB determined that this secondary analysis of de-identified data did not constitute human subjects research so did not require review
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityData was never abstracted for this study; training and testing of models was done on a secure cloud-based sever that prohibits the authors from extracting data for distribution. Thus, it is not technically feasible to share the training or test data. The code used to train the models is freely available on Github and open source, as noted in the manuscript text.
Comments (0)