Deep learning-based prediction of cervical canal stenosis from mid-sagittal T2-weighted MRI

Despite the clinical significance of DCM and the increasing popularity of DL in radiology, large cohort studies and extensive comparisons between DL architectures on the task of predicting CCS have not been thoroughly explored in the literature. In our study, we established a large DCM cohort comprising 8676 patients to train a variety of DL models, including ResNet50, VGG16, MobileNetV3, and EfficientNetV2. The ensemble model resulted in the best AUC of 0.96. Grad-CAM analyses have shown that the trained models consistently show sensitive feature-detecting capabilities and agreement with each other.

The patient cohort used in our study is considerably larger than those used in preliminary research on developing DL models for predicting DCM, as most cohorts hardly include more than 1000 patients [10,11,12,13]. As this number is not suitable for training deep learning models, which require at least tens of thousands of data points, most preliminary studies were forced to oversample inputs from the same patient, making the dataset highly redundant. However, in this study, we were able to exploit the abundance of data and achieved state-of-the-art AUC and generalizability, comparably improved from previous studies, in which one reported an AUC of 0.94 with axial T2 MRI [12].

Furthermore, unlike previous studies, we experimented with a variety of DL architectures to comprehensively understand the behaviors of different models. In general, they exhibited congruency and a similar level of performance, as represented in Fig. 4a and Fig. 4b, showing representative CCS-negative and positive cases. However, explainability differed in some samples, highlighting the need for model diversity. Figure 5a exhibits multiple CCS at C3-4 and C6-7, and CAMs for ResNet50 and VGG16 are vaguely distributed, while those of MobileNetV3 and EfficientNetV2 are separately focused on the affected levels. This results in a bimodal heatmap of the ensemble model, thereby illustrating the complementary nature and importance of ensembling models.

Figure 5b exhibits diffuse CCS caused by OPLL affecting C2-4. All of the models successfully localize the level of cord compression, but CAMs slightly differ. VGG16, MobileNetV3, and the ensemble model produce heatmaps centered at the thecal sac, but interestingly, EfficientNetV2 precisely pinpoints the causative lesion. Thus, training a single DL architecture may not be enough to specify all abnormal findings present in the input, and it is recommended to take a comprehensive approach that considers the outputs of each model rather than relying solely on a single model.

Figure 5c demonstrates a patient with T2 hyperintensity in C3-6 without definite evidence of cord compression. The output probabilities of the models ranged from 0.20 to 0.59, indicating controversy among models. Grad-CAM analyses imply that the signal abnormality of the spinal cord may have raised the outputs of some models. Therefore, DL models are capable of capturing possibly problematic features from the input images.

Figure 6a demonstrates a sample that produced false negative outputs. The models generally do not have difficulty locating a region with the greatest likelihood of CCS. However, confusion may have been caused by the subtle degree of spinal cord indentation and sufficient CSF space present along the spinal canal. Figure 6b is a test sample that resulted in false positive predictions. In Fig. 6b, models may have generated large output based on the findings at C3-5, where severely progressed cervical spondylosis is identified. Thus, it can be inferred that falsely classified samples are mostly difficult cases that may be controversial even for experts to determine solely from a few mid-sagittal slices. However, considering Grad-CAM analyses, DL models possess high explainability and can potentially guide clinicians.

Despite the strengths of our DL model, there exist some limitations that should be addressed in future work. First, our cohort contains selection bias since the data was retrospectively collected from patients who have undergone an MRI of the cervical spine at a tertiary hospital. Therefore, the prevalence of CCS was 59.0%, much greater than that of the general population, which is less than 0.1% [1, 2]. This may have resulted in the models being overly sensitive to positive findings, raising false positive rate, but it should be considered beneficial in the clinical context because it may contribute to reducing false negative errors. To minimize the possibility of data leakage and further improve generalizability, training and validating models on a large-scale multi-center cohort is necessary. An external set established at separate healthcare institutions may be used to promote a more generalizable evaluation of the model, and this is left for future work.

Another drawback of our study is the inevitable loss of information caused by selecting only three sagittal images for input from each patient. This approach was made to facilitate transfer learning of pretrained models, which generally take three-channel inputs. However, it may have reduced the sensitivity in identifying CCS-positive cases in which the cause does not originate from lateral structures not commonly observed from the mid-sagittal plane, for example, paracentral bulging of intervertebral disc and asymmetric deformity of the spinal canal. This may have resulted in some false negative predictions. Novel architectures that utilize multimodal or multi-channel inputs by incorporating three-dimensional convolutional layers should be thoroughly explored for future work [21,22,23].

In conclusion, we established a large DCM cohort and successfully developed DL models that predict CCS from mid-sagittal T2-weighted MRI sections. Through extensive analyses, we confirmed that ensembled models result in the best overall performance, but careful examination of the outputs of each model is suggested to effectively assess explainability.

Comments (0)

No login
gif