A time-sequenced approach to machine learning prognostic modelling with implementation on running-related injury prediction

Abstract

Background The use of machine learning (ML) methods in medical prognostic modelling is gaining popularity, yet all currently available source models were designed from a general mathematics perspective. These models encounter limitations when embedding discipline-specific information, which restricts model practical interpretability.

Objective This study aims to introduce two novel prognostic ML source models designed with an area-specific approach by testing their performance against commonly used ML methods, and exploring their interpretability.

Methods Multidisciplinary risk factor measurements (genetics, history, neuromuscular capacity, biomechanics, body composition, nutrition, training) were conducted on competitive endurance runners, who were subsequently monitored weekly over 12 months for running-related injuries (RRIs). Data was fitted with commonly used ML methods and the novel models using a stratified 10-fold cross validation framework for performance comparisons. Interpretable feature interactions were tested for statistical significance, and extracted feature importance scores were tested for correlation with Shapley Additive Explanation (SHAP) values.

Results 6,181 valid weekly samples were collected from 142 competitive endurance runners. The novel methods’ performances (AUC 0.736-0.753, Accuracy 0.822-0.849, Sensitivity 0.376-0.455, Specificity 0.859-0.896) matched those of commonly used ML methods (AUC 0.649-0.784, Accuracy 0.662-0.857, Sensitivity 0.337-0.568, Specificity 0.671-0.904). Pairwise feature interactions revealed stable patterns (p<0.001). Method-specific computationally efficient feature importance scores moderately correlated with SHAP values (r=0.12-0.72), showing increases as model parameterization increased.

Conclusion The novel methods showed comparable performance and better interpretability against common ML methods. Interpretability improved with increasing parameterization, suggesting performance may further improve with larger datasets and more features. Future research should perform higher quality validations using larger datasets before these methods can be widely adopted in prognostic modelling research.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This research was supported by the Alan Turing Institute Enrichment Scheme and the China Scholarship Council.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The National Health Service (NHS) Research Ethics Committee (South West - Central Bristol) and the Loughborough University Ethics Sub-Committee gave ethical approval for this work.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data and Code Availability

Processed data is shared within the supplementary materials. Raw data is available upon contact, however some metrics (e.g. past performance records) need to remain normalized to ensure participant anonymity. Code is available via GitHub page: henrywu0709/TSNN-TSGNN-for-prognostic-modelling.

Comments (0)

No login
gif