First experiences with machine learning predictions of accelerated declining eGFR slope of living kidney donors 3 years after donation

This study was conducted to test the ability of machine learning models to preoperatively predict relevant deterioration of excretory kidney function following kidney donation. Scientific research is mainly focused on predicting the outcome of graft function, which has been previously attempted using machine learning [9,10,11]. Contrary to findings of older studies, kidney donors were shown to be at increased risk of developing ESKD [3, 4]. Identifying these at-risk donors is still an unmet need in clinical practice.

We used the eGFR slope as our target for predictions. As a dynamic parameter, we consider the eGFR slope to be a better parameter for assessing donor kidney function than just eGFR at a specific time point during follow-up. Particularly for donors with borderline pre-donation eGFR, the extent of eGFR changes over time provides a more comprehensive picture of the current kidney function compared to past time-points, and reflects the approach of clinicians by putting eGFR in a temporal context.

The use of eGFR slope as a surrogate parameter to evaluate kidney function has been discussed in previous literature [35–39]. A recently published meta-analysis reported associations between treatment effects altering the GFR slope and the respective clinical endpoints targeting worsening kidney function. The authors concluded that GFR slope serves as a good surrogate parameter for evaluating kidney function in clinical trials [38], which has also been considered by regulatory agencies such as the U.S. Food and Drug Administration (FDA) [40] and the European Medicines Agency (EMA) [41].

A normal decline in kidney function is approximately 1 mL/min/1.73 m2/year [18]. The median eGFR slope of our donor collective was − 0.33 mL/min/1.73 m2/year, which is consistent with previous findings reporting the measured GFR slope of donors to be around − 0.4 mL/min/1.73 m2/year [2, 42]. Based on these findings, we defined a relevant eGFR slope at − 1 mL/min/1.73 m2/year in the third follow-up year. This resulted in an unbalanced dataset with 185 donors in the average eGFR slope cohort and 53 donors in the accelerated declining eGFR slope cohort.

Neither the ESKD risk score nor the descriptive statistics of the other pre-donation donor features used for model training effectively discriminated the donor cohort with the accelerated declining eGFR slope. Therefore, we employed machine learning to effectively identify this donor cohort. Three machine learning models (random forests, extreme gradient boosting, support vector machines) and logistic regression as the state-of-the-art model were utilized to predict accelerated declining eGFR slope of our donor cohort. Overall, no model sufficiently predicted the outcome. Neither of the models exceeded an AUC of 0.7 or an F1 score of 0.5.

Also, Jeon et al. [12] reported mediocre performance with machine learning in predicting the percentage of renal adaptation (6–12 months post-donation eGFR/pre-donation eGFR, cut-off: 65% of pre-donation eGFR after donation) of kidney donors after training with preoperatively assessed donor features. The authors reported an AUC of 0.63, which is similar to our results. They additionally trained the machine learning model to predict the absolute median eGFR of the second half of the first follow-up year (cut-off: 60 mL/min/1.73 m2). Here, clearly improved model performance with an AUC of 0.85 was observed. However, we consider predicting excretory kidney function decline to be superior to predicting GFR alone, as discussed above.

Despite the low predictive performance of the machine learning models, there are some observed trends of the distinct model performances when trained on different data subsets. The first data subset we used for model training included patient characteristics for calculating the ESKD risk score for kidney donors. The risk score was first introduced in 2016 by Grams et al. [7] after observing more than 4,000,000 individuals who were formally eligible for kidney donation, for 4–16 years. In our transplant center, we use this risk score to screen for potential donor candidates and to exclude donors at risk. Our interest was to find out whether these well-established parameters are sufficient to predict accelerated declining eGFR slope with machine learning.

The calculated 15-year and lifetime ESKD risk score for our donor cohort was below 1% for both eGFR-slope cohorts. Interestingly, a statistically significant difference was noted for the 15-year ESKD risk score. However, the differences in the absolute values were marginal. The calculated risk scores themselves were not included in model training. Likewise, we did not consider non-insulin dependent diabetes and race for model training due to one-dimensionality in our patient cohort.

The best performance using the risk score dataset was noted for support vector machines, which are known to be efficient with small datasets [43]. However, differences in model performance compared to the other models were marginal. In our study, machine learning models failed to adequately predict accelerated declining eGFR slope after being trained on previously evaluated patient characteristics for ESKD risk-prediction.

Subsequently, we integrated more features into the dataset and expected improved predictions related to the greater amount of information. We included additional donor details such as body weight, height, pack years, or renal cortex volumetry from CT scans (Dataset 2). For the entire dataset (Dataset 3), results of the time-zero biopsy, including Banff Lesion Scores, were added. Even though the histology of living donor kidneys is not available in pre-donation screening, results of the time-zero biopsy might affect the remaining renal outcome of living kidney donors.

Including more parameters led to slightly better results for random forests and extreme gradient boosting but worsened the predictions for support vector machines. Logistic regression showed consistent performance across the different data subsets. Barah et al. [44] also reported a slight improvement in model performance for predicting kidney discard with machine learning after adding parameters from the graft biopsy. Nevertheless, expanding the dataset with predefined features did not improve predicting donors with an accelerated declining eGFR slope.

Finally, we applied machine learning-driven feature selection to the whole dataset. We used model agnostic sequential feature selection in a forward approach by sequentially adding the most informative features to enhance model performance in k-fold cross-validation [20]. After sequential forward selection, each model exhibited a different subset of best predictive features. Eight and six best predictive features were found for logistic regression/support vector machines and extreme gradient boosting/random forests, respectively. We then retrained each model with the respective selected features. A clear improvement in prediction was observed for extreme gradient boosting and random forests. Both ensemble methods revealed a k-fold AUC of 0.66 and a k-fold F1 score of 0.44, and outperformed logistic regression and support vector machines which did not show improved predictive performance. These findings are consistent with previous machine learning studies in kidney transplantation: Feature selection improved predictive performance [9], and random forests or extreme gradient boosting outperformed logistic regression [9, 10, 44].

The best predictive features that appeared in all four models after sequential forward selection were the features related to smoking, namely smoking history or pack years, and the Banff Lesion Score g (glomerulitis). Smoking as a cardiovascular risk factor is widely known to enhance the incidence of developing chronic kidney disease [45]. Therefore, it is not surprising that all four models use features related to smoking to improve predictions for accelerated declining eGFR slope.

The Banff classification is designed for allograft pathologies [15]. Nevertheless, pathologies in the time-zero biopsy provide insights about the donor’s remaining kidney. The Banff g lesion score classifies the proportion of microvascular inflammation within glomeruli which may be linked to antibody-mediated graft rejection or to recurrent or de novo glomerulonephritis [15]. Previous studies reported that glomerulitis was associated with allograft pathologies or graft failure [46–50]. The conclusive determination of whether the reasons for glomerulitis may be recipient-associated, such as humoral rejection or recurrence of an underlying condition, is hindered by inconsistent documentation regarding the timing of biopsy acquisition in relation to reperfusion. Whether the presence of glomerulitis in the time-zero biopsy of the graft allows a conclusion to be drawn about the outcome of the remaining kidney function of living kidney donors needs to be investigated in further studies.

From a data science perspective, we faced a few hurdles that accounted for the moderate model performances. We trained our models on a small dataset that was unbalanced and consisted of missing values. There is a widespread belief that artificial intelligence is designed to only recognize patterns in large amounts of data. However, small datasets are common in the medical field. Althnian et al. [51] empirically investigated the influence of data size on the performance of machine learning models using datasets from the medical domain. They found that it is not the data size itself that affects the predictive ability, but rather how closely the data reflect the general distribution of a patient cohort. These findings are consistent with the results of our study: Not including more data but identifying the predictive features and retraining the models without redundant features improved the predictions.

The limitations of our study are that we used the eGFR values instead of measured GFR to calculate the eGFR slope. Our dataset consisted of missing values, mainly due to incomplete documentation of the histopathological parameters. There are no gold standards in data science for the allowed number of missing values in a dataset, which, thus, remains a field of empirical testing. We did not include all parameters that define the Banff classification due to one-dimensionality. Our dataset stemmed from one transplantation center. The performance of the machine learning models was evaluated by k-fold cross-validation which allows to investigate the ability of the models to generalize the information. To further test the predictive performance and generalizability of the models, an external test set is required for validation.

Comments (0)

No login
gif