A total of 83,430 cases from the SEER database were included in this study as the training and testing sets for the period between 2010 and 2017, and 18,545 cases from 2018 to 2019 were used as the external validation set (Table 1). Among these 83,430 patients, 43,889 (52.6%) were male. The racial distribution consisted of 65,641 (78.7%) White individuals, 8,776 (10.5%) Black individuals, and 9,013 (10.8%) individuals from other ethnicities. From Table 1, it can be observed that none of the univariate variables showed statistical significance between the testing and training sets. More detailed features are presented in Table 1.
Table 1: Characteristics of all patientsDifferences in characteristics between patients with and without LNMTable 2 displays the characteristics of patients with and without LNM. The results indicate significant differences between patients with and without LNM in terms of age, tumor location, T stage, tumor size, tumor grade, CEA levels, and race (all P < 0.001). The occurrence rate of LNM was higher in patients younger than 60 years compared to other age groups (P < 0.001). CEA-positive patients had a higher LNM occurrence rate (P < 0.001). As tumor grade increased, the proportion of LNM also increased (P < 0.001). LNM in rectal cancer was significantly higher than in colon cancer. Additionally, the occurrence rate of LNM increased with larger tumor diameters.
Table 2 Difference analysis of patients with or without LNM in the training setRisk factors associated with LNM in CRC patientsThe results of univariate and multivariate logistic regression analyses are presented in Table 3. Multivariate logistic regression analysis revealed that T stage, CEA levels, tumor size, and tumor grade were independent risk factors for LNM in CRC patients. The risk of LNM in rectal cancer patients was 1.4 times higher than in colon cancer patients (OR 1.40; 95% CI 1.34–1.46). Compared to patients with tumor diameter < 7 cm, the risk of LNM occurrence in patients with tumor diameters of 7–15 cm and > 15 cm was 1.20 times (OR 1.20; 95% CI 1.01–1.43) and 1.36 times (OR 1.36; 95% CI 1.16–1.61), respectively. Compared to patients with grade I tumors, the risk of LNM occurrence in patients with grade II, III, and IV tumors was 1.31 times (OR 1.31; 95% CI 1.21–1.41), 2.32 times (OR 2.32; 95% CI 2.13–2.52), and 2.39 times (OR 2.39; 95% CI 2.11–2.72), respectively. Compared to T1 stage, the risk of LNM occurrence in patients with T2, T3, and T4 stages was 1.18 times (OR 1.18; 95% CI 1.68–2.03), 5.40 times (OR 5.40; 95% CI 4.96–5.89), and 8.89 times (OR 8.89; 95% CI 8.08–9.80), respectively. White individuals had the lowest risk of LNM compared to Black individuals and individuals from other races. CEA-positive patients had a 1.28 times higher LNM occurrence rate compared to CEA-negative patients (OR 1.28; 95% CI 1.24–1.33). The occurrence rates of LNM in patients aged 60–70 years and > 70 years were 0.76 times (OR 0.76; 95% CI 0.73–0.80) and 0.55 times (OR 0.55; 95% CI 0.53–0.58), respectively, compared to patients younger than 60 years.
Table 3 Univariate and multivariate logistic regression analyses of factors associated with LNMModel comparison and selectionWe constructed models using logistic regression, LASSO regression, ridge regression, and elastic net regression. Table 4 presents the AUC values of these models in the training and testing sets. In the testing set, the logistic regression, LASSO regression, ridge regression, and elastic net regression models achieved AUCs of 0.708 (95% CI 0.704–0.712), 0.707 (95% CI 0.703–0.711), 0.708 (95% CI 0.704–0.712), and 0.708 (95% CI 0.702–0.714), respectively. There were no significant differences in AUC among these models (P > 0.05). Although all models performed similarly, the logistic regression model was more clinically interpretable. Therefore, the logistic regression model was selected (Fig. 1).
Table 4 The area under the receiver operating characteristic curve (AUC) for different modelsFig. 1Nomogram for predicting lymph node metastasis (LNM) in colorectal cancer (CRC) patients
Nomogram for predicting LNM in CRC patientsTable 5 displays the performance of the logistic regression model. In the testing set, the logistic regression model achieved an AUC of 0.708 (95% CI 0.704–0.712), accuracy of 0.637 (95% CI 0.63–0.641), sensitivity of 0.736 (95% CI 0.730–0.742), specificity of 0.569 (95% CI 0.564–0.574), PPV of 0.539 (95% CI 0.534–0.545), and NPV of 0.759 (95% CI 0.753–0.764). The Hosmer–Lemeshow goodness-of-fit test indicated good calibration of the predictive model (χ2 = 10.207, P = 0.251). Furthermore, during external validation, the model achieved an AUC of 0.709 (95% CI 0.701–0.716), indicating its good applicability to external validation data (Fig. 2, Table 5).
Table 5 The performance of the logistic regression prediction modelFig. 2Receiver operator characteristic (ROC) curves and the area under the ROC curve (AUC) for the logistic regression prediction model in the training set, test set, and external validation. A ROC curves in the training set; B ROC curves in the test set; C ROC curves in the external validation
Further validation based on different subgroupsFurther validation was conducted based on gender, age, and tumor grade (Table 6). In the testing set, the logistic regression predictive model exhibited good performance for male and female patients, as well as patients aged < 60 years, 60–70 years, and > 70 years, and those with grade I tumors. The AUC values for these subgroups were 0.705 (95% CI 0.696–0.714), 0.711 (95% CI 0.702–0.720), 0.685 (95% CI 0.673–0.696), 0.708 (95% CI 0.696–0.720), 0.704 (95% CI 0.694–0.714), and 0.737 (95% CI 0.714–0.760), respectively. In the external validation dataset, the predictive model also demonstrated good applicability to these subgroups, with AUC values of 0.710 (95% CI 0.700–0.720), 0.707 (95% CI 0.696–0.718), 0.696 (95% CI 0.683–0.710), 0.710 (95% CI 0.696–0.724), 0.705 (95% CI 0.694–0.717), and 0.746 (95% CI 0.720–0.771).
Table 6 The performance of the prediction model based on different populationsModel fit analysisThe calibration curves of the nomogram (Fig. 3A–C) demonstrated high consistency between predicted and observed survival probabilities in both the training and validation cohorts. Additionally, the decision curve analysis (DCA) curves (Fig. 3D–F) indicated the good clinical utility of our model.
Fig. 3Calibration plots.: Show the consistency of the predicted potentiality and actual values。A–C The consistency of the predicted potentiality and actual values in the training set、the test set and in the external validation. D, E Decision curve analysis (DCA). Assessing clinical utility in the training set、the test set and in the external validation
Comments (0)