Second, our outcome was self-reported in the questionnaire, which might suffer from selection bias or reporting bias. For selection bias, NHANES utilized national representatives through complex sampling and manual quality control to mitigate the selection bias; we also involved the sampling weights in the modeling to adjust the effect size for the whole population and to evaluate the selection bias. For the reporting bias assessment, we adopted UKB hospital inpatient data to correct the labels.
Third, as a matter of concern, is the proportion of observation deletion and feature deletion. We excluded: 1) observations with non-valid (‘Refused to answer’ or ‘Don't know’) responses in covariates; 2) observations with unlabeled (missing values and non-valid responses in outcomes) data; 3) in NHANES 2015–16 for model development, features with over 30% missing values. We could infer the biases were small. First, the non-valid answers were out of the scope of our analysis. We want to explore features that are clear in definitions to the medical practitioners/caregivers; excluding non-valid answers might magnify the effect of valid answers. Additionally, the total proportions for these answers were small, ranging from 14%–18% of raw data for NHANES and 23% for UKB. So, the bias seemed to be small by deleting non-valid answers in covariates. Then, for the unlabeled data, the sensitivity analysis suggested low bias, with 0.02 AUROC reduction from over 200% sample size in NHANES 2015–16. Finally, for feature deletion, we deleted mainly the biomarkers with over 30% missing values (including selenium, cadmium, and LDL-cholesterol); they tended to be collected by subgroups and ended up with a small quantity. Still, we had large amounts of features in each dataset (30 to 122) in model development, and some of them could be the proxy variables for the deleted variables (e.g., LDL cholesterol could be approximately derived from Total cholesterol and HDL cholesterol), thereby reducing the bias of lost information from the deleted features.
Fourth, related to feature deletion was that many important established blood biomarkers were not involved in our modeling. NHANES lacked some important stroke-associated variables, such as lipoprotein-associated phospholipase A2, D-dimer, and interleukin 6, etc., of which the involvement might further promote the performance. However, based on the results of blood biomarkers, we deduced that these lacking laboratory biomarkers might still be outperformed by the SHAP set in our modeling. Moreover, we prioritized using several features that are the informative and easily collected features to improve the simplicity and utility of our nomogram model and gain a higher generalization capability of the model.
Fifth, the NHANES data still lacks information on other features. For example, arthritis history had 5708 observations, while the type of arthritis had only 1433; thyroid history had 5806, while current thyroid status only had 593. Thus, the more informative features like arthritis type and current thyroid were with small sample size and therefore deleted because of too many missing values, although it also resulted in bias in our results.
Despite the limitations, our results, using a few questionnaire-based clinical variables, have high and robust performances and can potentially provide generalized and specialized characteristics for stroke survivors for a newer estimate of stroke prevalence.
Appendix A. Data and methodologiesData
Clinical and second clinical set variables from demographic, examination, and questionnaire data
The lists for the clinical set and the second clinical set included: 1) influential factors: high blood pressure, smoking, diabetes, physical inactivity, obesity, high blood cholesterol, heart diseases, sickle cell disease, age, race, gender, income, alcohol, drug abuse, sleep habits, oral health, gout, asthma, angina, thyroid, cancer, and hepatitis; 2) symptoms: face, limb weakness/numbness, speech slurred (confusion), trouble seeing, walking and severe headache; 3) complications: urinary tract infection and/or bladder control, pneumonia, swallowing problem, clinical depression, shoulder pain/anxiety, breathing problems, aspirin. Moreover, possible confounders, including diabetes risk, taking insulin, medication for depression, anxiety, and cholesterol, were also added. The clinical set consisted of the influential factors, while the second clinical set was a combination of the above variables.
Blood biomarkers from laboratory data
A list of blood biomarkers, including oxidative stress, metabolic and inflammatory ones, was compiled. It included: 1) metabolic: calcium, iron, cadmium, chloride, total cholesterol, triglyceride, percentage of segmented neutrophils, red cell distribution width, glycohemoglobin, potassium, sodium, high-density lipoprotein cholesterol (HDL C), folic acid, and glucose; 2) inflammatory: white blood cell count, hematocrit and platelet count, aspartate and alanine aminotranferase, gamma glutamyl transferase, lactate dehydrogenase, creatinine, high-sensitivity C-reactive protein, Monocyte/HDL-C ratio, hemoglobin; 3) oxidative stress: Segmented neutrophils/Lymphocyte ratio, total bilirubin, uric acid; 4) neurohormone: cotinine.
Dietary nutrients from dietary data
The American Heart Association Diet and Lifestyle Recommendations encourages eating a variety of nutritious foods from all the food groups and eating less nutrient-poor foods to fight cardiovascular disease. In our work, we involved all the nutrients that comprise foods or beverages in our dietary intake for stroke prediction; dietary supplement was also counted. These nutrients included: 1) nutrients that offered the most calories: carbohydrates, sugars, total fats, and HDL C, protein, fiber, saturated and unsaturated (monounsaturated and polyunsaturated) fatty acids; 2) vitamins: vitamin A/B1/B2/B6/B12/C/D/E/K, alpha-carotene, beta-carotene, lycopene, lutein, riboflavin, niacin, folic acid, B-cryptoxanthin, theobromine, and folate; 3) minerals: sodium, phosphorus, zinc, intake of such as potassium, magnesium, iron, copper, and selenium; 4) other nutrients: water, alcohol, and caffeine.Variable definitions and extraction
After the variables mentioned above were extracted from NHANES 2015–16 database in different files, they were merged according to the unique ID ‘SEQN’ to generate the four datasets. The occurrence of stroke was determined by the subject's answer to the questions ‘Has a doctor or other health professional ever told that . . . had a stroke?’. Out of 9575 individuals, 5714 answered with either ‘Yes’ (209) or ‘No’ (5505), 3856 had missing values, and five individuals refused to answer or wrote ‘Don't know’. Based on this stroke proportion, other features were merged or removed if there were over 30% of values missing or replaced by similar and proxy variables if available. For example, smoking was defined based on the response to the question ‘Smoked at least 100 cigarettes in life?’ rather than ‘Do you now smoke cigarettes?’ because the latter produced over 30% missing values. Observations with ‘Don't know’ or ‘Refused to answer’ answers of stroke would also be deleted, but they would be counted in the semi-supervised model. Systolic blood pressure (SBP) and diastolic blood pressure (DBP) in the examination data were obtained from three consecutive readings after the subject had rested for five minutes in a seated position and after the determination of the maximum inflation level. The fourth read would be needed if a BP measurement is interrupted or incomplete. The values for SBP and DBP were then averaged by ourselves from the three readings. Urine biomarkers in the ‘Laboratory Data’ were excluded because of the small sample size and few variables after deleting variables with over 30% missing values, but the ‘Dietary Data’ dietary supplement was considered and added to the total nutrient intakes. Moreover, ‘Added alpha-tocopherol (Vitamin E) (mg)’ and ‘Added vitamin B12 (mcg)’ were also added to ‘Vitamin E as alpha-tocopherol (mg)’ and ‘Vitamin B12 (mcg)’. The final values for diet were averaged by ourselves from the first and the second-day records.
Methodologies
Datawig imputation
DataWig imputation, which can be used for numerical, categorical, and unstructured text data [[22]Bießmann F. Rukat T. Schmidt P. Naidu P. Schelter S. Taptunov A. Lange D. Salinas D. DataWig: missing value imputation for tables.],was adopted in this study. Inspired by established approaches [[57]Flexible Imputation of Missing Data.], Datawig follows the process of multivariate imputation by chained equations (MICE) [[58]van Buuren S. Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R.]. First, for strings and character sequence features, they are dealt with string and numeric representation and then further transformed into embeddings and hashing from (Long Short Term Memory) LSTM [[59]Sundermeyer M. Schlüter R. Ney H. LSTM neural networks for language modeling.] or n-gram [[60]Cheng W. Greaves C. Warren M. From n-gram to skipgram to concgram.]; for numeric features, they are transformed into embeddings. Then all embeddings are concatenated and finally fitted with a regression or cross-entropy loss in terms of the type of missing value. DataWig compares favorably with other implementations (mean [[61]Young W. Weckman G. Holland W. A survey of methodologies for the treatment of missing values within datasets: limitations and benefits.], k-nearest neighbor (KNN) [[39]Nearest neighbor selection for iteratively kNN imputation.], matrix factorization [[62]Koren Y. Bell R. Volinsky C. Matrix factorization techniques for recommender systems.], MissForest [[57]Flexible Imputation of Missing Data.], MICE [[58]van Buuren S. Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R.]) for numeric and unstructured text imputation, even in the complex condition of missing-not-at-random [[22]Bießmann F. Rukat T. Schmidt P. Naidu P. Schelter S. Taptunov A. Lange D. Salinas D. DataWig: missing value imputation for tables.].Feature selection by BoostARoota
We used BoostARoota [] to filter out redundant features and select important ones. BoostaARoota, as a modified version of Boruta algorithm [[63]Feature selection with the Boruta package.], is a wrapper feature selection algorithm. Compared to Boruta, BoostARoota uses Xgboost [[64]XGBoost: a scalable tree boosting system.] as the base model and modifies the feature elimination process, being computationally faster than Boruta. We first repeated BoostARoota for thirty times (by changing its random seed) to choose the overlaps as the robust features to be used for further analysis.Imbalance classification analysis
In general, there are two strategies to handle class imbalance classification, i.e., data-level approach and algorithm-level approach. [[65]Ali A. Shamsuddin S.M. Ralescu A.L. Classification with class imbalance problem: a review.]The data-level approach employs a preprocessing step to rebalance the class distribution. Samplings, as a preprocessing step, are very effective methods to address class imbalance [[66]The class imbalance problem: Significance and strategies.] and have been proven to improve the predictive power of modeling in class-imbalanced datasets [[67]Resampling methods improve the predictive power of modeling in class-imbalanced datasets.]. Our H2O model adopted a sampling method to adjust skewed stroke distribution to improve training. In addition to sampling methods, feature selection is another preprocessing step gaining popularity in class imbalance classification tasks. Feature selection removes irrelevant, redundant, or noisy data present in the problem of class overlapping in class imbalance [[65]Ali A. Shamsuddin S.M. Ralescu A.L. Classification with class imbalance problem: a review.,[68]Cuaya G. Munoz-Meléndez A. Morales E.F. A minority class feature selection method.]. We applied feature selection to reduce features to be used in our nomogram.The algorithm-level approach, where the algorithms are fine-tuned to improve the learning of smaller classes, includes one-class learning and cost-sensitive learning (16). Our IF model is a one-class classification algorithm aiming at outlier or anomaly detection [[24]Chen W.-R. Yun Y.-H. Wen M. Lu H.-M. Zhang Z.-M. Liang Y.-Z. Representative subset selection and outlier detection via isolation forest.]. It can be effective for imbalanced classification datasets where stroke cases are both few in number and different in the feature space. Our DNN is a Cost-Sensitive Neural Network (25). It is trained with the Focal Loss function (29) to assign a larger error weight to stroke cases and reshape the standard cross-entropy loss to improve class-imbalance learning during the standard DNN training. Our LR adjusts observation weights inversely proportional to stroke frequencies in the training data to improve class imbalance training [[27]Learning with positive and unlabeled examples using weighted logistic regression.].Thresholding is another cost-sensitive approach that is applied at the data level in a postprocessing step, aiming to identify the optimal decision thresholds for classification. [[69]Esposito C. Landrum G.A. Schneider N. Stiefl N. Riniker S. GHOST: adjusting the decision threshold to handle imbalanced data in machine learning.] For binary classification, 0.5 is typically the threshold, while it may be biased with respect to the major class in imbalanced data [[69]Esposito C. Landrum G.A. Schneider N. Stiefl N. Riniker S. GHOST: adjusting the decision threshold to handle imbalanced data in machine learning.,[70]Zhang X. Gweon H. Provost S. Threshold moving approaches for addressing the class imbalance problem and their application to multi-label classification.], so using decision thresholds is an alternative technique that can deal with class imbalance [[71]Collell G. Prelec D. Patil K. Reviving Threshold-Moving: a Simple Plug-in Bagging Ensemble for Binary and Multiclass Imbalanced Data.]. The Youden index is a linear transformation of the mean sensitivity and specificity. It can define thresholds to avoid failure in evaluating the algorithm's ability and is applied in imbalance cases due to its invariance to imbalance ratios. [72Starovoitov V.V. Golub Y.I. Comparative study of quality estimation of binary classification., 73Pena F.A.G. Fernandez P.D.M. Tarr P.T. Ren T.I. Meyerowitz E.M. Cunha A. J regularization improves imbalanced multiclass segmentation., 74Usman M. Khan S. Lee J.A. AFP-LSE: antifreeze proteins prediction using latent space encoding of composition of k-spaced amino acid pairs.] Therefore, we also used the Youden index to define thresholds for classification.SHapley Additive exPlanations
SHapley Additive exPlanations (SHAP), as a tool for model interpretation, is based on a game-theoretic approach and can explain the output of any ML model [[75]Zhang K. Zhang Y. Wang M. A unified approach to interpreting model predictions Scott.]. Thus it could provide clinical value for interpreting the influential factors of stroke. SHAP was developed to solve the problem of inconsistency in many feature attribution methods. The so-called inconsistency means that the role of a certain feature in the model plays an important role, but the calculation methods of the importance of the feature, such as “Gain”, “Split”, and “Saabas”, assign a lower importance value to it; SHAP guarantees this consistency in theory [[76]Lundberg S.M. Erion G.G. Lee S.-I. Consistent Individualized Feature Attribution for Tree Ensembles.].SHAP assigns each feature an importance value affecting a particular classification with that feature. For a given feature, featurei, the Shapley value [[77]A unified approach to interpreting model predictions.] is the weighted average of all possible marginal contributions of featurei, which is used as the feature attribution. Eq. (1) presents a classic Shapley value estimation [[77]A unified approach to interpreting model predictions.] for featurei.SHAPfeatureix=∑set:featurei∈subsetsubset∗Fsubset−1Predictionsubsetx−Pred
Comments (0)