MmisAT successfully annotated 13 mtDNA, 321 nDNA with evidence of pathogenicity, and 1127 nDNA whose expression was localized within the mitochondria (Table 1). The corresponding transcript numbers were 13, 1563, and 5366, respectively (Table 1). Notably, MmisP features were selected from the MmisAT annotation (Fig. 1).
Table 1 Range of genes and transcripts covered by MmisATFig. 1Workflow for MmisAT and MmisP. Left is MmisAT, right is MmisP
MmisAT can handle protein-coding variants from both nuclear DNA and mtDNA and generate 349 annotation types across six categories (Additional file 4: Table S1). It processes 4.78 million variant data in 76 min, making it a valuable resource for clinical and research applications (Additional file 7: Figure S1). To explore the factors affecting the performance of MmisP, we generated different models and compared their performance on testing datasets. Using Vari_Train, we established a supervised learning classification algorithm to obtain the optimal predictive model for mitochondrial diseases. Although all models performed well with an accuracy rate of over 70% (Table 2), the accuracy metric alone is insufficient to reflect the generalization ability of the model due to the uneven characteristics of Human Whole Exome sequencing data. Therefore, we adopted comprehensive evaluation metrics such as Recall, Precision, F1 Score, Matthew Correlation Coefficient (MCC), and Area Under the Curve (AUC) to assess the performance of each algorithm model. AdaBoost and Logistic Regression showed high Recall rates, both exceeding 80%, while KNeighbors and Decision Tree performed poorly, with recall rates of 71.08% and 68.90%, respectively. Random Forest had the highest accuracy rate of 82.75%, indicating a low probability of misclassifying benign variants as disease-causing variants when using this algorithm. Although Logistic Regression generated the highest F1 score, it was only 0.02% higher than AdaBoost. The MCC values for Logistic Regression, Random Forest and SVM were all greater than 0.7, indicating that these three algorithms could be compared to REVEL [33] and M-CAP [34] at the recommendation threshold of 75% (as detailed below). However, the AUC values for KNeighbors and the Decision Tree were both less than 0.8, which was inconsistent with our expectations for binary models.
Table 2 Performance of various algorithmic modelsThe features used in our model exhibit different data dimensions and types, and some features may contain redundant information. However, since MmisP focuses on missense variants in mitochondrial diseases, the number of features has little impact on computational time and resource consumption. To evaluate whether features related to mitochondrial function could improve model performance, we constructed six algorithm models using excluded feature subsets (excluding mitochondrial-specific annotations) under the same training conditions. With the exception of Random Forests (which divide nodes by randomly selecting features, so that there may be no significant change in performance for highly linearly correlated features), the performance of all models declines. In particular, accuracy and precision have declined by about 1%, and other metrics have also changed to varying degrees (Table 3). Given the advantage of AUC, we ultimately chose Logistic Regression with a value greater than or equal to 0.9 (0.904, 0.900) to build MmisP. We also use the learning_curve function in the scikit-learn package to evaluate the relationship between MmisP's performance and the size of the training set. As can be seen from Additional file 8: Figure S2, when the training set size is small, the training set error is low, and the cross-validation set error is high. When the size of the training set increases gradually, the model can be generalized better, and the errors of both tend to be stable. MmisP does not underfit or overfit and therefore does not benefit from more training data. In the external tenfold cross-validation loop, the accuracy of MmisP is not only high but also very stable (0.873, 0.827, 0.850, 0.759, 0.770, 0.821, 0.829, 0.790, 0.834, 0.829, mean accuracy: 0.819 ± 0.033), which also shows that the model is universal. In addition, we tested two Logistic Regression models on Vari_TestUnbalance to carefully observe the enhancement effects of disease-specific features on the models.
Table 3 Performance of various algorithm models under feature subsets (excluding mitochondrial-specific features)The results demonstrated the performance of the predictor, which was evaluated using the confusion matrix shown in Fig. 2. In clinical settings, accurately identifying all pathogenic variants is crucial, which leads to a high true positive (TP) rate (269 > 267). Likewise, it is important to avoid misclassifying true pathogenic variants as benign variants to minimize the false negative (FN) rate (12 < 14). To understand the relative importance of features in the prediction model, we calculated parameter ωi in the Logistic Regression algorithm. Since each feature corresponds to a model parameter ωi, the absolute value of ωi indicates the degree to which it affects the predicted result. Notably, the impact of SIFT_score was the greatest, with a ωi value of 0.89, consistent with our expectations (as shown Fig. 3). Other important features included protein site conservation, population frequency, and tissue expression information of proteins. In addition, the network-centric measures of Closeness_centrality and Eigenvector_centrality ranked second (ωi = 0.63) and twelfth (ωi = 0.21), respectively, highlighting the potential benefit of considering all mitochondrial proteins as an interacting network. Moreover, the peak logarithm of tissue expression (MitoCarta 3.0) ranked fourteenth (ωi = 0.20), indicating the potential of genotype-tissue expression data in improving variant classification accuracy.
Fig. 2Confusion matrix of Logistic Regression model in two feature backgrounds. A A confusion matrix containing 115 complete features. true positive:269, true negative:488, false positive:189, false negative:12. B A confusion matrix that does not contain mitochondria-specific features. true positive:267, true negative:498, false positive:179, false negative:14
Fig. 3Importance of each feature in the Logistic Regression model. Blue is the common feature and green is the mitochondria-specific feature
MmisP outperforms genome-wide pathogenicity predictors based on overall classification performance measuresTo evaluate the performance of MmisP on Variants of Undetermined Significance (VUS), we compared it to several other genome-wide variant pathogenicity predictors, including M-CAP, REVEL, CADD [35], Eigen [36], and PrimateAI [37], which are renowned for their performance in predicting the pathogenicity of missense variants. We evaluated them using the Vari_TestUnbalance dataset, which emphasizes how the loss of prediction scores limits the utility coverage of pathogenicity predictor. While MmisP and DANN [38] scored all test variants, M-CAP lost nearly 40% missing scores, and the prediction scores of MutationAssessor (8.35%) [39], Polyphen2-HDIV (4.07%) [15], Polyphen2-HVAR (4.07%) [15], Eigen (4.27%) and PrimateAI (4.27%) were also lost. By comprehensively evaluating the performance of all predictors, we found that their Precision (PPV) ranged from 29.47% to 81.74%, and only PrimateAI had a precision exceeding 80% (Table 4). On the other hand, the negative predictive value (NPV) ranges from 78.18 to 100%, with NPVs of more than 90% for all predictors except M-CAP and PrimateAI. The specificity values ranged from 1.03% to 96.76%, and the recall values ranged from 34.94% to 100%.
Table 4 Performance of MmisP and other genome-wide toolsCompared to recall, the lower specificity of some pathogenicity predictors suggests that some benign variants may be incorrectly classified as disease-causing. Therefore, it is necessary to establish a more stringent threshold for all pathogenicity predictors. Since the Vari_TestUnbalance dataset is imbalanced, with the number of benign variants exceeding that of disease-causing variants, the Precision-Recall Curve (PRC) is a better indicator of predictor performance (Fig. 4A). Since PRC is sensitive to sample size, we can observe the effect of sample size changes on predictor performance. Among the evaluated predictors, DANN, MutationTaster [40] and fathmm-MKL [41] had average precision of less than 0.7, whereas MmisP among the top with a score of 0.87. The feasibility of MmisP in practical applications has been demonstrated by the unbalanced data of nuclear gene variants associated with mitochondrial diseases. In contrast, the ROC Curve remained unchanged in the case of sample imbalance, so we plotted the ROC Curve and calculated Area Under Curve (AUC). We found that the AUC values of MetaLR (0.901) [42], MetaSVM (0.904) [42] and MmisP (0.938) were greater than 0.9, indicating their advantages r over other predictors (Fig. 4B). For Vari_TestUnbalance, the best classification threshold calculated was 0.624, indicating that MmisP performed better at this threshold. As Vari_TestBalance has a good data balance, the area under the PRC curve for each predictor increased (Fig. 4C), and MmisP still had the largest area. The overall ROC curve did not significantly change (Fig. 4D), with the best classification threshold being 0.523. In summary, MmisP is suitable for application in different variance backgrounds.
Fig. 4P-R curve and ROC curve of MmisP and other genome-wide pathogenicity predictors under two testing sets. A P-R curve under Vari_TestUnbalance. B ROC curve under Vari_TestUnbalance. Black dots represent the optimal threshold of MmisP under current circumstances is 0.523. C P-R curve under Vari_TestBalance. D ROC curve under Vari_TestBalance. Black dots represent the optimal threshold of MmisP under current circumstances is 0.624
The distribution of disease-causing and benign variants prediction scoresTo gain deeper insights into the classification process of MmisP and other predictors, we calculated the prediction score for each variant in the Vari_TestBalance dataset and classified them based on a set threshold (0.5 for MmisP). We visualized the distribution of "score" for pathogenic and benign variants using violin plots (Fig. 5). Notably, MutationTaster had the poorest performance with almost all variants receiving prediction scores above the threshold. Eigen, DANN, and fathmm-MKL also performed poorly in classifying benign variants, with around half being falsely classified as disease-causing (Eigen 40.2%, DANN 40.2%, and fathmm-MKL 69.1%). Although M-CAP with high threshold had high sensitivity (the ability to correctly classify variants), it sacrificed specificity, resulting in approximately 57.4% of benign variants being classified as disease-causing. This indicates that an excessive pursuit of high sensitivity may reduce the resolution of exome variant analysis, increase the number of suspicious disease variants, and make it difficult for predictors to identify one or two ‘causative’ variants, hindering the diagnosis of genetic diseases. On the other hand, MmisP, as a proprietary tool, only misclassified 18.7% of benign variants and 9.6% of disease-causing variants. Polyphen2-HVAR also performed well with a misclassification rate of only 26.2% for benign variants and 15.9% for disease-causing variants.
Fig. 5Distribution of prediction scores from MmisP and other genome-wide pathogenicity predictors (Based on the Vari_TestBalance: 256 benign, 239 pathogenic). The red line is the threshold for each tool
Performance of MmisP under the ACMG/AMP variant interpretation guidelinesMmisP performed remarkably well in Vari_Test4Gene with a limited range of gene. MmisP's accuracy, F1score, recall and MCC were all higher than the other three tools. PrimateAI's accuracy (100.00%) is perfect, but its recall (45.45%) is the lowest. MmisP's (0.936) AUC value is second only to REVEL's (0.961), which also illustrates its usefulness as a disease-specific pathogenicity predictor (Table 5). We assessed the performance of MmisP using defined thresholds (Probability: Pr > 0.75; Benign variant: Pr < 0.15; Variants of Unknown Significance (VUS): 0.15 ≤ Pr ≤ 0.75) [43], and the PP3/BP4 evidence unique to the Mitochondrial Disease Variants Interpretation Guidelines. We tested MmisP against the guideline-recommended tools REVEL and M-CAP at low classification thresholds using Vari_TestThreshold dataset (Table 6). MmisP had a recall rate of 97.81%, second only to REVEL (98.39%), but significantly better than M-CAP (41.67%, p < 0.001). Overall, MmisP correctly classified 56.74% of missense variants, with a slightly higher accuracy than REVEL (52.54%) and M-CAP (52.19%). Additionally, MmisP minimized the number of variants with VUS, with only 38.88% of predicted scores falling into that range, lower than REVEL (45%). M-CAP (34.5%), at the cost of low specificity, only classified a small number of variants as being disease-causing. In conclusion, at the extreme threshold, MmisP can classify 61.12% missense variants as being disease-causing or benign, with 92.84% of them being correctly classified.
Table 5 Comparison of MmisP and other predictors in four widely studied genes (POLG, SLC19A3, PDHA1, ETHE1)Table 6 Performance of MmisP and other tools under recommended thresholdsPerformance on simulated disease exomesIn the analysis of Mendelian disease exomes, the major challenge is to identify one or two "causative" disease-causing variants among of hundreds of predicted disease-causing variants, even after applying a standard allele frequency filter to remove common benign variants (MAF > 1%). However, due to the large number of predicted disease-causing variants, it could be difficult to pinpoint the few variants that are truly responsible for the disease, especially with limited resources such as time and cost that make it infeasible to experimentally validate various candidate variants. To address this issue, we randomly selected background exons from 170 and 29 healthy individuals from the 1000 Genomes Project into two groups and introduced a ‘causative’ disease-causing variant into the background exomes to simulate the exomes of Mendelian disorders, which were named Simulated_Exome170 and Simulated_Exome29, respectively (as described in Supplementary Methods).
This study aimed to evaluate the performance of different pathogenicity predictors in identifying disease-causing variants in simulated exomes. To compare the length of the list of disease-causing variants identified by different predictors, we firstly calculated the percentage of disease-causing variants predicted by each predictor using the threshold values recommended. MetaLR generated the smallest candidate variant list, predicting only 0.475 ± 0.377% of Simulated_Exome29 variants as disease-causing, while MmisP predicted 38.419 ± 1.244% (Fig. 6A and Additional file 5: Table S2). Simulated_Exome170 showed a similar trend, with MetaLR predicting 0.644 ± 0.351% of variants as disease-causing, while MmisP remained the worst performer at 38.567 ± 1.366% (Fig. 6B and Additional file 6: Table S3). Additionally, we found all tools had a similar trend in the percentage of disease-causing variants predicted in both simulated disease exomes. Next, we evaluated the ability of the pathogenicity predictors to rank the ‘causative’ disease-causing variants among the top-scoring ones. After sorting the scores for each predictor, we calculated the average rank of disease-causing variants introduced in the exome simulations (Fig. 6C and Additional file 6: Table S3). In Simulated_Exome29, MmisP performed well with an average rank of 39.655 ± 55.478 (median rank: 18), which was only slightly worse than the best-performing tool, MetaLR, with an average rank of 15.414 ± 8.604 (median rank: 12), but the difference was not significant (Mann–Whitney p = 0.703). In Simulated_Exome170, MmisP and MetaLR both showed excellent performance (Mann–Whitney p = 0.027), with average ranks of 12.429 ± 21.382 and 12.162 ± 5.880(median rank: 3.5 and 10), respectively (Fig. 6D and Additional file 6: Table S3). Overall, there were significant differences in the average rank of the ‘causative’ disease-causing variants between the two simulated exomes (Additional file 6: Table S3).
Fig. 6Evaluation of the different pathogenicity predictors using two simulated exomes. A Distribution of the percentage of predicted disease-causing variants in the Simulated_Exome29. B Ranking of the “causative” disease-causing variants introduced in Simulated_Exome29. C Distribution of the percentage of predicted disease-causing variants in the Simulated_Exome170. D Ranking of the “causative” disease-causing variants introduced in Simulated_Exome170
Comments (0)