Evaluation of Machine Learning Models for Early Prediction of Gestational Diabetes Using Retrospective Electronic Health Records from Current and Previous Pregnancies

ABSTRACT

Objective To assess the performance of machine learning (ML) models in predicting gestational diabetes mellitus (GDM) using electronic health record (EHR) data from the first antenatal visit, and determine whether incorporating previous pregnancies data improves performance.

Methods and Analysis In this retrospective cohort study, several ML models were developed to predict GDM using EHR data (n=27,561, GDM 11.6%), from nulliparous and multiparous populations. Past pregnancy data (n=4,005) were incorporated to improve future (preconception) GDM predictions. Four ML algorithms were evaluated: logistic regression (LR), random forest (RF), XGBoost (XGB), and explainable boosting machine (EBM). Model performance was evaluated on an internal validation set assessing model discrimination (AUROC) and model calibration (plots, slope and intercept).

Results The Feature Agnostic Model (all features) achieved AUROC 0.832 (slope 0.967; intercept −0.088) with LR, similar to more complex models such as XGB (AUROC 0.828; slope 0.976; intercept −0.072) and EBM (AUROC 0.829; slope 0.939; intercept −0.131). The Sequential Model that included first trimester and previous pregnancy data demonstrated the highest predictive performance, with XGB achieving an AUROC 0.904 (slope 0.618; intercept −0.136). Subset models using top clinical features maintained strong performance, particularly with the Sequential Model achieving an AUROC 0.897 (slope 1.137; intercept 0.161) using only eight features.

Conclusion Incorporating previous pregnancy data improved ML performance for GDM prediction. Further, using a subset of clinically relevant features yielded similar performance, supporting potential integration into clinical decision support systems. These findings highlight the promise of early GDM risk identification in both nulliparous and multiparous populations; however, additional research, including external validation and clinical trials, is needed to determine the models’ practical utility and effect on maternal and neonatal outcomes.

KEY MESSAGES What is already known on this topic?

Machine learning (ML) approaches have been used to identify gestational diabetes mellitus (GDM) risk in early pregnancy, but most studies focus only on data from the current pregnancy, and do not leverage prior obstetric history.

What this study adds

Incorporating data from previous pregnancies substantially improved ML models’ predictive performance, and reducing the models to a small set of key features still yielded good performance, assessed by model discrimination and calibration, suggesting feasibility for real-time clinical use at or even before the first antenatal visit.

How this study might affect research, practice or policy

These findings support the possibility of earlier GDM detection and intervention, and highlight the need for external validation and prospective trials to confirm broader utility, inform clinical workflows, and guide policy on integrating ML-driven risk prediction into routine maternity care.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work has emanated from research supported in part by a grant from Research Ireland under Grant Number 18/CRT/6183.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The Research Ethics Committee of the Coombe Hospital, Dublin, Ireland, gave ethical approval for this work.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

Due to patient confidentiality and data use agreements, individual-level data cannot be shared publicly.

AbbreviationsAPAverage PrecisionAUROCarea under the receiver operating characteristic curveBMIBody mass indexCDSSclinical decision support systemsEHRElectronic health recordEBMExplainable Boosting MachineFAMFeature Agnostic ModelFPGfasting plasma glucoseGDMGestational diabetes mellitusHbA1cglycosylated haemoglobinIADPSGInternational Association of Diabetes and Pregnancy Study GroupsIDspatient identifiersLRLogistic RegressionMLMachine learningNPMNulliparous ModelOGTToral glucose tolerance testRFRandom ForestSPMSequential Pregnancy ModelSHAPSHapley Additive exPlanationsTGtriglyceridesXGBeXtreme Gradient Boosting

View original article

Medrxiv - Health Informatics

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Evaluation of Machine Learning Models for Early Prediction of Gestational Diabetes Using Retrospective Electronic Health Records from Current and Previous Pregnancies

Comments (0)