ColoViT: a synergistic integration of EfficientNet and vision transformers for advanced colon cancer detection

Experiment environment

The hardware configuration for the experimental platform utilized in this paper includes an Intel Xeon(R) E5-2780 CPU with a 2.80 GHz core frequency and an NVIDIA GeForce RTX 1080 GPU. The proposed model is implemented in Python 3.7 using the PyTorch framework (Paszke et al. 2019), ensuring a combination of high computational power and state-of-the-art software capabilities to handle deep learning tasks.

Evaluation metrics

The evaluation of the classification model included four essential criteria for medical diagnosis: Accuracy, precision, recall, and F1 score. These measures are vital for clinical decision making, highlighting the importance of not only achieving high overall accuracy but also efficiently minimizing the occurrences of false negatives and false positives.

$$\begin & \text = \frac \end$$

(6)

$$\begin & \text = \frac \end$$

(7)

$$\begin & \text = \frac \end$$

(8)

$$\begin & \text = 2 \times \frac \end$$

(9)

Where TP represents true positive samples, TN true negative samples, FP false positive samples, and FN false negative samples.

In addition to the core evaluation metrics (Accuracy, Precision, Recall, and F1-Score), we also compute other widely used statistical metrics to further assess model reliability, particularly in medical diagnostics.

Specificity (True Negative Rate) is given by:

$$\begin \text = \frac \end$$

(10)

Matthews Correlation Coefficient (MCC) provides a balanced measure even if classes are imbalanced:

$$\begin \text = \frac} \end$$

(11)

Cohen’s Kappa Score and Youden’s Index are also computed to assess model agreement with ground truth and diagnostic strength, respectively.

Cohen’s Kappa Score is calculated as:

$$\begin \kappa = \frac \end$$

(12)

where \(p_o\) is the observed agreement (accuracy), and \(p_e\) is the expected agreement by chance.

Youden’s Index is defined as:

$$\begin J = \text + \text - 1 \end$$

(13)

Evaluation of proposed system

The dataset is divided into three sets: training, validation, and test sets in an 80:10:10 ratio. During training, a batch size of 32 and a learning rate of 0.0001 were employed to optimize the balance between comprehensive learning and computational efficiency. Both the EfficientNet model and ViT model underwent a training process has involved 50 epochs. The choice of 50 epochs was determined based on preliminary experiments indicating an optimal balance between achieving sufficient model convergence and preventing overfitting, given the complexity of the models and the dataset size.

After training, the model parameters were evaluated using the test dataset. The ColoViT model, which is an Ensemble model, demonstrated the successful integration of the EfficientNet and ViT models, utilizing their respective strengths. The performance indicators acquired during the training and validation stages offer valuable insights into the model’s learning process. For instance, during epoch 1, the training accuracy of EfficientNet V2 B0 was recorded as 75.1%, whereas the validation accuracy was 73.36%. The ViT-B16 model demonstrated slightly lower accuracies, while the ColoViT showcased the highest accuracy. As the training advanced to 50 epochs, noticeable improvements in accuracy and reductions in loss were observed across all models, with ColoViT consistently exhibiting superior performance.

By following these steps, the proposed ColoViT Ensemble model is used for colorectal cancer classification. This Ensemble model takes advantage of the strengths of both models to enhance classification performance and accuracy. Table 4 shows the performance metrics for the Ensemble Model, ViT-B16, and EfficientNet V2 B0 models during training and validation.

Table 4 Model Performance Summary

Figure 8 illustrates the loss and accuracy curves for both the Vision Transformer model and the EfficientNet model throughout the training and validation phases. The graph on the left illustrates the Training and Validation Loss, wherein both losses exhibit a decreasing trend across the epochs, suggesting a notable enhancement in the model’s performance. The ColoViT Ensemble model exhibits the lowest validation loss, indicating superior generalization capabilities. The graph on the right displays the Training and Validation Accuracy. It is evident that the ColoViT Ensemble model exhibits the best accuracy, suggesting its superior predictive capabilities. The presented graphs highlight the superior learning capacity and effectiveness of the Ensemble methodology compared to the training of individual models.

Fig. 8figure 8

Training and Validation Loss and Accuracy for Models

The Table 5 provides a detailed evaluation of the inference times for three models: EfficientNet V2 B0, ViT-B16, and an ensemble model, ColoViT. EfficientNet V2 B0 shows a rapid average inference time of 0.0059 s per sample with a standard deviation of 0.0033 s, highlighting its efficiency and consistency. The Vision Transformer model, ViT-B16, has a slightly longer average inference time of 0.0089 s per sample, but with a very low standard deviation of 0.0003 s, indicating highly predictable performance. The ensemble model ColoViT, which combines multiple models to improve prediction accuracy, has the longest average inference time of 0.0147 s per sample. This increased time is typical of ensemble methods due to the need for multiple model evaluations. These findings emphasize the trade-offs between different model architectures, balancing inference time efficiency and computational demands, which is crucial for deploying models in real-time applications. Such comparisons are vital for understanding the practical implications of using different models in various scenarios.

The Ensemble model, ColoViT, outperforms individual models in terms of inference time efficiency. While EfficientNet V2 B0 demonstrates a balance of speed and accuracy, ViT-B16 excels in capturing global dependencies, albeit at a slightly higher computational cost. The Ensemble model strategically combines both architectures to achieve an optimal balance of accuracy and inference time, making it suitable for real-time clinical applications. Despite the slightly increased computational time of 0.0147 s per sample, the Ensemble model’s enhanced diagnostic performance makes it a reliable and efficient approach for diagnosing colorectal cancer.

Table 5 Inference Time Efficiency Comparison

Although the ColoViT ensemble model improves classification accuracy by leveraging both EfficientNet and Vision Transformer architectures, it also introduces increased computational complexity and inference time. As shown in Table 5, the ensemble model requires an average of 0.0147 s per image, which is approximately 2.5 times slower than EfficientNet V2 B0 alone. This overhead is typical for ensemble approaches, which involve multiple model evaluations.

While the inference latency is acceptable for semi-real-time diagnostic scenarios in clinical settings equipped with high-performance GPUs, it may pose limitations for deployment in resource-constrained or mobile environments. To address this, future work can explore model compression, pruning, quantization, or knowledge distillation techniques to reduce the model’s size and accelerate inference while maintaining diagnostic accuracy.

Confusion matrix analysis

This section summarizes the confusion matrices for EfficientNet V2 B0, Vision Transformer (ViT-B16), and the proposed ColoViT model across the training, validation, and test datasets. The six classes considered are: Normal tissue (NORM), Hyperplastic Polyp (HP), Tubular Adenoma High-Grade (TA.HG), Tubular Adenoma Low-Grade (TA.LG), Tubulo-Villous Adenoma High-Grade (TVA.HG), and Tubulo-Villous Adenoma Low-Grade (TVA.LG).

Training dataset

As shown in Fig. 9, all three models achieve high classification accuracy on the training data. ColoViT demonstrates superior performance with the fewest misclassifications, especially in distinguishing between closely related classes such as TA.HG and TA.LG.

Fig. 9figure 9

Confusion Matrix for Training Dataset

Validation dataset

On the validation set (Fig. 10), performance trends remain consistent. ViT-B16 shows balanced predictions, while ColoViT continues to lead in classification accuracy and demonstrates better differentiation between visually similar categories.

Fig. 10figure 10

Confusion Matrix for Validation Dataset

Test dataset

The test set results (Fig. 11) reinforce the overall trend. ColoViT achieves the highest number of correct classifications and minimal misclassifications, confirming its robustness in real-world scenarios. EfficientNet and ViT also perform well but show more confusion between TA.HG and TA.LG.

Fig. 11figure 11

Confusion Matrix for Testing Dataset

Overall, the analysis confirms that the ensemble-based ColoViT model consistently outperforms the individual EfficientNet V2 B0 and ViT-B16 models, offering improved classification reliability for colorectal histopathological images.

Upon closer examination of the misclassifications, most errors occurred between classes that exhibit subtle visual differences in histological features. For instance, TA.HG samples are occasionally misclassified as TA.LG, which can be attributed to the presence of overlapping glandular structures and ambiguous dysplastic features between high-grade and low-grade adenomas. Similarly, HP samples were sometimes predicted as NORM, likely due to their benign appearance and structural resemblance under certain staining conditions. These misclassifications are also influenced by class imbalance and inter-class similarity in color and texture patterns, which may challenge the discriminative capacity of even advanced models. Such observations underline the need for enhanced feature representations or potential incorporation of domain knowledge to further improve differentiation between visually similar classes in future work.

Classification report

To offer a comprehensive analysis of performance indicators, including precision, recall, F1-score, and support, Table 6 is included, providing a detailed breakdown of these metrics by class. The table presents comprehensive performance metrics for each class across three models, namely EfficientNet-V2 B0, ViT B16, and the ColoViT Ensemble Model. The NORM class, which exhibits the highest level of support with a total of 340 instances, has outstanding precision and recall rates of 99.7% and 99.8% respectively when employing the ColoViT Model. These results indicate the model’s remarkable capability in precisely identifying true positives and its reliability in effectively separating the NORM class from other classes. Regarding Hyperplastic Polyp (HP), all models exhibit elevated metrics; however, the ColoViT Model crosses the 99% barrier in precision and recall, indicating a noteworthy decrease in both false positives and false negatives for this crucial category. The performance of Tubular Adenoma High-Grade dysplasia (TA.HG) and Tubulo-Villous Adenoma High-Grade dysplasia (TVA.HG) is especially noteworthy, as the ColoViT Model demonstrates somewhat superior recall for TA.HG and precision for TVA.HG. This suggests that the ColoViT Model excels in accurately diagnosing these less common disorders.

The ColoViT Model demonstrates higher precision and recall rates for Tubular Adenoma Low-Grade dysplasia (TA.LG), Tubulo-Villous Adenoma Low-Grade dysplasia (TVA.LG), with a notable emphasis on TVA.LG. Specifically, the precision rate for TVA.LG reaches 99.6%, while the recall rate reaches 99.5%. The ColoViT Model demonstrates enhanced skill in accurately discerning these conditions, which often pose a substantial challenge. It achieves precision and recall rates of 99.3%, indicating its effectiveness in improving the diagnosis of these classes. The row labelled ’Micro Avg’ presents the mean performance measure across all classes, providing a comprehensive performance evaluation for each model. The ColoViT Model exhibits a micro-average accuracy, precision, recall, and F1-score of 99.4%, illustrating its uniformity across all classes.

Table 6 Class-Wise Performance Metrics of the Models

The high precision and recall seen across all classes demonstrate the reliability and resilience of the ColoViT Model, which are essential qualities for its potential therapeutic application. While the accuracy rate is a significant indicator, it alone does not provide a comprehensive understanding of the classification specifics. To enhance comprehension of classification performance, the confusion matrix, offers a detailed analysis of prediction data for each category. The ColoViT Model demonstrates excellent performance compared to other models, with an impressive accuracy rate of 99.4%. This highlights its exceptional capacity to accurately classify colorectal polyps and adenomas in the context of diagnosis. The detailed information provided by the confusion matrix analysis is crucial for understanding each model’s capabilities in accurately categorizing specific classes, thereby guiding clinical approaches for colorectal polyp classification. The exceptional performance of the ColoViT Model, with an accuracy rate of 99.4%, represents a notable advancement compared to current models. This achievement has the potential to redefine the benchmarks for accuracy in colorectal polyp and adenoma classification.

To further strengthen the evaluation of the proposed ColoViT model, we computed additional diagnostic metrics that are essential in medical image analysis, such as Specificity, Matthews Correlation Coefficient (MCC), Cohen’s Kappa Score, and Youden’s Index. These metrics were computed per class and are presented in Table 7, highlighting the robustness and reliability of the proposed ensemble approach across all polyp categories.

Table 7 Class-wise Additional Diagnostic Metrics for the Ensemble Model (ColoViT)

To verify the statistical robustness of the observed performance improvements, a paired t-test was conducted between the ColoViT model and each of the baseline models (EfficientNet and ViT) using the F1-score across five cross-validation folds. The resulting p-values were less than 0.01 in both comparisons, indicating that the improvements achieved by the ensemble model are statistically significant.

Additionally, 95% confidence intervals (CI) for classification accuracy were computed across the cross-validation folds. As shown in Table 8, the ensemble model (ColoViT) achieved an average accuracy of 98.9% with a 95% CI of [98.7%, 99.1%], whereas EfficientNet and ViT achieved CIs of [95.4%, 96.7%] and [96.8%, 97.8%], respectively. The p-values reflect the probability that the performance differences occurred by chance, while the confidence intervals indicate the degree of consistency in model performance across folds. These results confirm the statistical significance and reliability of the performance improvements offered by the ColoViT framework.

Table 8 Statistical Evaluation of Model Performance Using 5-Fold Cross-ValidationROC curve analysis

In addition to the confusion matrix analysis, the ROC-AUC curve analysis will be a pivotal component of the evaluation. It offers a comprehensive measure of each model’s performance across various threshold settings, further confirming their diagnostic reliability and clinical applicability. The ROC curves reported in this study (as shown in Fig. 12) depict the performance evaluation of three distinct machine learning models, namely EfficientNet V2 B0, ViT B16, and ColoViT. The plotted curves illustrate the relationship between the true positive rate (TPR) and the false positive rate (FPR), allowing for an examination of the classifiers’ diagnostic capabilities at different threshold levels.

Fig. 12figure 12

ROC Curves for EfficientNet V2 B0, ViT B16, and ColoViT Models

The EfficientNet V2 B0 model demonstrates high discriminating capabilities, as seen by its AUC values approaching 1 across all categories. This suggests a remarkable proficiency in distinguishing between positive and negative classes. The ColoViT model exhibits notable performance, as evidenced by some categories achieving an AUC of 1, suggesting the possibility of achieving complete classification accuracy. Finally, the Receiver Operating Characteristic (ROC) curve of the ViT B16 model also exhibits elevated Area Under the Curve (AUC) values, thus validating the model’s strong precision in tasks related to classification. The persistent positioning of these curves in close proximity to the upper left corner of the graph area signifies a notable degree of precision in forecasting, accompanied by a minimal occurrence of both false positives and false negatives. This highlights the resilience of these models in effectively carrying out their respective predictive functions. Moreover, the models exhibit high AUC values, indicating their suitability for implementation in distinct clinical contexts, such as diagnostic imaging or patient risk assessment, where achieving high levels of sensitivity and specificity is of utmost importance.

The ROC curve is an essential tool for evaluating the trade-off between sensitivity (or TPR) and specificity (1 - FPR) across different thresholds without requiring an arbitrary classification threshold. This makes the ROC curve particularly valuable in medical diagnostic tests where the cost of false negatives varies significantly with the clinical context. The high AUC values reinforce the potential of these models to act as reliable decision-support tools in medical diagnostics, potentially reducing the cognitive load on healthcare professionals and increasing diagnostic accuracy.

This finding underscores the capacity of these classifiers to serve as decision-support instruments in the field of medical diagnostics, enhancing the proficiency of healthcare professionals. In a comparative analysis, these models demonstrate comparable or superior performance to existing benchmarks in the realm of automated diagnosis, signifying a notable progression in the field of artificial intelligence within the healthcare domain.

The ROC-AUC analysis in this study was performed using a one-vs-rest (OvR) strategy, where each class was individually treated as the positive class while all others were grouped as negative. This approach enabled the generation of distinct ROC curves for each of the six classes, providing class-specific insights into discriminative performance. The overall macro-average AUC was calculated by averaging the AUCs across all classes.

Although OvR is widely adopted for multi-class settings, it does not account for inter-class relationships and may be influenced by class imbalance, particularly in cases with underrepresented categories. As a result, the near-perfect AUC values observed for some classes may reflect dataset-specific characteristics rather than true generalization. To address this, ROC-AUC findings should be considered complementary to the class-wise evaluation metrics already discussed. Incorporating micro-averaged AUCs or hierarchical performance assessments could further strengthen future multi-class evaluations.

To assess the generalizability of the ColoViT model and rule out overfitting, additional evaluation was performed using the publicly available NCT-CRC-HE-100K dataset (Kather et al. 2019), which includes 100,000 H& E-stained colorectal histology image patches categorized as normal or tumor. This dataset was acquired independently from a different institution and exhibits variations in staining protocols and imaging characteristics. The pretrained ColoViT model was directly tested on a balanced subset of this dataset without retraining. The model achieved an average accuracy of 97.1%, F1-score of 96.8%, and AUC of 98.4%, demonstrating strong generalization capability and robustness across datasets. This external validation supports the clinical relevance and deployment potential of the proposed method.

The proposed ColoViT model, developed and evaluated using the UniToPatho dataset, demonstrates strong potential for integration into clinical workflows as a decision-support system in digital pathology. UniToPatho offers high-resolution, expert-annotated H& E-stained colorectal tissue samples across multiple diagnostic categories, making it highly suitable for training and performance benchmarking. To assess generalizability beyond this dataset, the model was further validated on the independent NCT-CRC-HE-100K dataset, which contains histology images from a different institution with distinct staining and acquisition protocols. The model maintained high accuracy and AUC on this external dataset, supporting its robustness across imaging conditions.

In a clinical setting, the model could be integrated into digital pathology platforms or hospital PACS systems, where it could assist pathologists by automatically screening histopathology slides, flagging high-risk cases, and prioritizing reviews. This could help reduce diagnostic delays and inter-observer variability, particularly in resource-limited settings. Ultimately, the application of such AI-powered tools has the potential to improve diagnostic consistency, enable earlier intervention, and contribute to better patient outcomes in colorectal cancer care.

Comments (0)

No login
gif