Predicting the Risk of Lumbar Prolapsed Disc: A Gene Signature-Based Machine Learning Analysis

Data Source

Transcriptomic data from the peripheral blood of patients with LPD were obtained from the publicly available Gene Expression Omnibus (GEO) database (GSE150408) to train the models [18]. LPD falls under the ICD-10-CM code M51 [19]. The inclusion criteria of the dataset were patients with sciatica and LPD confirmed by magnetic resonance imaging (MRI) imaging at either the L4/5 or L5/S1 levels. The dataset used excluded individuals with other neuropathies, spinal diseases, infections, rheumatic conditions, cardiovascular or metabolic diseases, dementia, mental health disorders, a history of surgery, congenital conditions, tuberculosis, or tumours [18]. The training dataset included 17 patients and 17 healthy volunteers. To evaluate the performance of the constructed models, we used an independent testing dataset, GSE124272, consisting of eight patients and eight healthy controls [20].

Ethical Approval

This research utilized existing studies and did not involve any new experiments with human participants or animals conducted by the authors. For model training, we obtained transcriptomic data from patients with LPD using the GEO database entry GSE150408, and for model testing, we used data from GSE124272. All procedures in these referenced studies involving human participants were approved by the Ethics Committee of the Sichuan Provincial Orthopedic Hospital.

Pain-Related Genes

In developing the models, we focused on 23 genes that have been previously reported to influence the risk of chronic back pain and widespread pain syndromes (Table S1) [21]. These 23 genetic factors are involved in intervertebral disc stability, inflammation, and pain signalling, which are important in LPD and pain [22,23,24]. The 23 pain-related genes are grouped into three categories as follows [21]: (i) Genes affecting intervertebral disc stability: COL9A2 (collagen type IX alpha 2 chain), COL9A3 (collagen type IX alpha 3 chain), COL11A1 (collagen type XI alpha 1 chain), COL11A2 (collagen type XI alpha 2 chain), COL1A1 (collagen type I alpha 1 chain), ACAN (aggrecan), CILP (cartilage intermediate layer protein), VDR (vitamin D receptor), MMP3 (matrix metallopeptidase 3), MMP9 (matrix metallopeptidase 9), THBS2 (thrombospondin 2); (ii) Genes influencing inflammation: IL1RN (interleukin-1 receptor antagonist), IL1A (interleukin-1 alpha), IL1B (interleukin-1 beta), IL6 (interleukin-6); and (iii) Genes involved in pain signalling: GCH1 (GTP cyclohydrolase 1), COMT (catechol-O-methyltransferase), OPRM1 (opioid receptor mu 1), OPRD1 (opioid receptor delta 1), MC1R (melanocortin 1 receptor), TRPV1 (transient receptor potential cation channel subfamily V member 1), TRPA1 (transient receptor potential cation channel subfamily A member 1), FAAH (fatty acid amide hydrolase).

Feature Selection

As a result of the limited number of samples and the high dimensionality of the features, we first generated 10 sets of simulated gene expression data by adding Gaussian noise to the original training and testing datasets. This approach can ensure stability in parameter estimation and reduces the risk of overfitting [25]. Gaussian noise was added to create simulated data that mimic the inherent variability of the original data, while preserving the overall structure and relationships between variables.

Both the original and simulated datasets were used to train and test the models. To identify the most relevant features for our predictive model, we employed recursive feature elimination (RFE) using leave-one-out cross-validation (LOOCV) as the resampling method [26, 27]. RFE is a feature selection technique that recursively removes features from the dataset and evaluates the model’s performance to identify the most important features. The process began with all features in the dataset and fit the model. The least important feature, as determined by the model’s feature importance scores, was then removed. The model was refitted using the remaining features, and the process was repeated until a specified number of features was retained. In this study, RFE with LOOCV was used to estimate the model’s performance for each feature subset. LOOCV is a cross-validation technique where one observation is used as the validation set, and the remaining observations form the training set. This process was repeated for each observation in the dataset, providing a better estimate of the model’s performance [28]. The optimal set of features was selected on the basis of the highest cross-validated performance. Importance scores were further extracted from the RFE process using the varImp function, which determines the relevance of each feature on the basis of the model’s internal criteria. This approach allowed one to identify and rank the most relevant genes contributing to the predictive power of the model.

Model Construction

In this study, multiple machine learning models were constructed on the basis of the selected gene signatures. To improve prediction performance, hyperparameter tuning was conducted using a grid search approach with fivefold cross-validation. For the support vector machine (SVM) algorithm with a radial basis function (RBF) kernel, the grid search was performed within the range of 2^(− 5:15) for C and 2^(− 15:3) for sigma to determine the optimal parameters. The random forest model was customized to enhance performance by tuning the number of variables randomly sampled as candidates at each split (mtry), the number of trees in the forest (ntree), and the minimum size of terminal nodes (nodesize): mtry from 1 to 8, ntree from 100 to 1000, and nodesize from 1 to 20. For the k-nearest neighbours (KNN) model, the tuning process involved exploring values for k from 1 to 100 to identify the optimal number of neighbours. The “optimal” kernel was selected to improve the weighting scheme for the neighbours’ contributions, ensuring that closer neighbours have a more significant influence on the classification decision. The distance metric used was set to 2, corresponding to the Euclidean distance. In the logistic regression model, the family parameter was set to binomial to specify logistic regression, as the target variable was binary. In the XGBoost model, the learning rate (eta) was set to 0.01, 0.1, and 0.3 to control the boosting step size. The maximum depth of trees (max_depth) was set to 3, 6, and 9 to determine tree depth, while the minimum sum of instance weights (min_child_weight) was adjusted to 1, 3, and 5 to regulate the data required at child nodes. The subsample ratio of training data (subsample) and features (colsample_bytree) were tested at 0.5, 0.8, and 1 to prevent overfitting. The minimum loss reduction (gamma) was set to 0, 1, and 3. The training process involved 1000 boosting iterations with early stopping after 10 rounds to monitor and prevent overfitting.

Performance Evaluation

Each machine learning model’s performance was assessed on the basis of four performance metrics: accuracy, F1 score, Matthews correlation coefficient (MCC), and area under the curve (AUC). The true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values were used to calculate these metrics.

Accuracy measures the proportion of correct predictions (true positives and true negatives) out of the total number of predictions. It is a widely used metric for classification problems but can be misleading in the case of imbalanced datasets.

F1 score is the harmonic mean of precision and recall, which combines both false positives and false negatives into a single metric. It is particularly useful when dealing with imbalanced datasets where the positive class is rare.

$$\text=2\times (\text\times \text)/(\text+\text)$$

MCC is a metric that takes into account all four values of the confusion matrix (true positives, true negatives, false positives, and false negatives) and provides a balanced measure of classification performance. It ranges from − 1 to 1, with 1 indicating perfect classification, 0 indicating random classification, and − 1 indicating complete misclassification.

$$\text = \left( \times \text - \text \times \text} \right)/sqrt\left( + \text} \right) \times \left( + \text} \right) \times \left( + \text} \right) \times \left( + \text} \right)} \right)$$

The AUC was also calculated to evaluate the classification models. The receiver operating characteristics (ROC) curve was generated by plotting the true positive rate against the false positive rate, representing the trade-off between sensitivity and specificity. The AUC provides a single scalar value that represents the overall performance of the model, with higher values indicating better performance. The AUC ranges from 0 to 1, with 1 being the best possible score.

Statistical Analysis

All statistical analyses were performed using R version 4.4.0 (2024, URL https://www.r-project.org). The feature selection was conducted using the RFE method, hyperparameter tuning, and cross-validation using the caret package [29]. The e1071 package was used to train the SVM model [30], the randomForest package was used for the random forest model [31], and the class package was used for the KNN model [32]. The function glm was used to train the logistic regression model [33]. The XGBoost model was trained using the xgboost package [34]. We used t test or Mann–Whitney U test to compare the expression levels of selected gene signatures between two groups, with significance determined by a p value less than 0.05. Benjamini–Hochberg false discovery rates (FDRs) were applied to correct for multiple hypotheses [35].

Comments (0)

No login
gif