Artificial intelligence-based, semi-automated segmentation for the extraction of ultrasound-derived radiomics features in breast cancer: a prospective multicenter study

The radiomic workflow proposed here extracts interpretable biomarkers and enables the development of predictive models that—exploiting shallow learning methods—manages better an analysis on tabular data. Radiomic biomarkers and shallow learning methods allow high-performance and interpretable models.

Although promising, ultrasound radiomics technologies for breast cancer assessment are currently under scrutiny, with reported AUC values in the differentiation between benign versus malignant breast masses ranging from 0.817 to 0.961 [14].

Lesion segmentation is the first and in many aspects the most crucial step of the radiomics process of feature extraction. On the one hand, a careful recognition of the breast lesion and its precise contouring is deemed mandatory. On the other hand, the segmentation task is performed by a dedicated operator, often a physician, it is time-consuming and operator-dependent, not to mention the introduction of a possible source of error [12]. In our study, we tried to overcome all these issues by using an AI software, which has been previously proved to improve the diagnostic performance of radiologists as well as intra- and inter-reader agreement in the characterization of FBLs [39]. This AI algorithm performs the lesion segmentation instantly, thus completely eliminating the relevant amount of time usually required to complete this task, at the same time leaving to the operator the final decision to accept the proposed segmentation.

In our experience, the semi-automated lesion segmentation process was highly reproducible, with a high rate of acceptance for both operators (91.3% and 93.3%) with no statistically significant difference in terms of discordance (p-value: 0.515). Interestingly enough, the analysis of not accepted segmentations showed a prevalence of malignant lesions, which were characterized by a posterior shadowing at B-Mode US. Notably, a study encompassing 367 patients showed that the peripheral tissue around a breast lesion influences the subsequent classification based on B-mode images: paradoxically, when assessing US images of breast cancer by means of radiomics, the perfect separation between the lesion and the surrounding breast parenchyma might not be the best option [40].

In our series, the stand-alone radiomics model showed an AUC of 0.644. This finding is in line with a study evaluating four qualitative ultrasound features (regular tumor shape, no angular or spiculated margin, posterior acoustic enhancement and no calcification) in 252 breast lesions to predict the histological grade, Ki67, HER2, axillary lymph node metastasis and lymphovascular invasion, with corresponding AUC values of 0.673, 0.680, 0.651, 0.587 and 0.566, respectively [41]. Similarly, we did not include in our radiomics model any color Doppler or US elastography information, which might have improved the diagnostic performance, as reported in [42]. Furthermore, we implemented a machine learning model, whereas deep learning algorithms have been shown to be more proficient with a reported AUC of 0.955 in a study encompassing 10,815 and 912 multimodal and multiplane breast US images for training and prospective testing, respectively [43]. Noteworthy, in a prospective study including 124 women, quantitative multiparametric multimodal US provided the best diagnostic performance for breast cancer diagnosis [44].

In our validation cohort, when adding radiomic features to US BI-RADS classification made by the radiologist, we obtained an AUC of 0.925, which is within the range of previously reported data. Of note, we observed a not statistically significant increase in specificity (0.929 vs. 0.9, p = 0,540) at the expense of a statistically significant decrease in sensitivity (0.756 vs. 0.933, p = 0,021). In particular, the combined model allowed the detection of two more benign lesions (65 vs. 63, respectively) but missed eight breast cancers (34 vs. 42, respectively), with a corresponding decrease in detection rate from 93.3 to 75.6%. The observed reduction of false positive rate and the consequent decrease of the number of unnecessary biopsies, although not statistically significant, is in line with the study by Shen et al., who presented an AI system that achieved an AUROC of 0.976 on a test set consisting of 44,755 exams as well as a decrease in the false positive rates by 37.3% and reduction of requested biopsies of 27.8% while maintaining the same level of sensitivity [45]. Differently from this latter study, in our prospective study the impact of missed cancer due to the increase of false negative rate outweighed the decrease of the number of unnecessary biopsies allowed by the reduction of false positive rate. However, the retrospective nature of that study prompted the authors to ask for a prospective validation before it can be widely deployed in clinical practice.

Comparing our results to a recent retrospective study encompassing 201 FBLs, a ML classifier showed higher accuracy in comparison with an expert radiologist (82% vs. 79.4%) in the differential diagnosis of benign and malignant breast masses, but the difference was not statistically significant. (p = 0.815) [46]. Of note, in that study the radiologist was able to read only one defined US image per lesion, being unaware of the patient's clinical history and, most notably, being unable to perform a complete US scan. On the contrary, in our study the reading radiologist performed a proper US scan, whereas the radiomics model was based on still images, which may not fully depict the US features of a breast mass. Furthermore, the reading radiologist made the diagnosis on the basis of all the available data, thus more closely resembling real clinical practice. These circumstances may at least partially explain the higher sensitivity of the expert radiologist observed in our series in comparison with the radiomics model. The same limitations affect a study focusing on an automatic classification of ultrasound breast lesions using a deep convolutional neural network (dCNN) involving 582 patients [18]. In that study, the performance of dCNN was found to be comparable to that of two radiologists with more than 5-year experience (AUC of 83.8 for the dCNN vs. 86.9 and 82.3 for the humans on the internal dataset, respectively), but the authors stated that the final decision should be always left to the radiologists. This statement is supported by our results, taking in account that it is based on the real clinical practice of highly trained and expert breast radiologists.

Noteworthy, a multicenter study by Gu et al., encompassing US images of 5012 patients, found out a statistically significant decrease in sensitivity values in the external test cohort of a binary deep learning model in comparison with the original radiologists (89.81% vs. 99.30%, p = 0.0020) [47]. However, in the same study a six categories BI-RADS-based deep learning model achieved sensitivity values not statistically different from those of the reading radiologists (95.37% vs. 99.30%, p = 0.1250). This binary assessment was also present in our study, and might have negatively affected our radiomic model results.

From a value-based healthcare perspective, AI may provide useful data-driven tools to empower multimodality breast imaging in the setting of screening, lesion characterization, therapy guidance and monitoring assessment [48]. Nevertheless, a translational gap still exists. In the near future, the AI-based system in mammography and ultrasound is not expected to replace radiologists, but instead to offer support in the decision-making process or to reduce the radiologist’s workload [49, 50]. To this purpose, multiparametric MRI may provide information on pathophysiological tumor characteristics, useful for imaging biomarker research aimed at improving prediction of treatment response, disease-free survival, molecular subtype and lymph node status [51,52,53]. However, multiparametric MRI and other imaging modalities still need further efforts to fully exploit the AI and radiomics in order to better individualize breast cancer treatment in the era of precision medicine [48].

In our series, the most important radiomic features were Busyness (belonging to the NGTDM category) is a measure of the change from a pixel to its neighbor. A high value for busyness indicates a ‘busy’ image, with rapid changes of intensity between pixels and its neighborhood. Analysis of the two distributions (i.e., malignant, benign) shows that malignant cases are associated with higher busyness values than benign cases. This is symptomatic of the fact that malignant lesions generally exhibit a more heterogeneous pattern, where areas with different densities (e.g., habitat) are present within the lesion. Then, LargeAreaHighGrayLevelEmphasis (belonging to the GLSZM category) measures the proportion in the image of the joint distribution of larger size zones with higher gray-level values. In malignant cases, it is more likely to have regions of higher density (i.e., areas with high gray values). Kurtosis (belonging to the FO category) is a measure of the ‘peakedness’ of the distribution of values in the image ROI. A higher kurtosis implies that the mass of the distribution is concentrated toward the tail(s) rather than toward the mean. A lower kurtosis implies the reverse: that the mass of the distribution is concentrated toward a spike near the mean value. This trend is perfectly aligned with the clinical literature: benign lesions generally have a more oval (roundish) shape than malignant lesions, which have a wide and greater variability in shape. This aspect reflects the fact that malignant masses have greater kurtosis than benign ones [54, 55]. Finally, maximum probability (belonging to the GLCM category) calculates the occurrences of the most predominant pair of neighboring intensity values, thus revealing the existence of a regular pattern within the lesion.

Our study has limitations. Firstly, the prevalence of breast cancer in our study population is higher than that of the general population, leading to a selection bias. Actually, considering that the presence of unbalanced datasets leads a ML classifier to classify the most represented class better, to obtain a predictive model capable of accurately classifying two different classes, there is the need to have a balanced dataset (i.e., benign and malignant) [35]. Furthermore, as secondary referral centers for breast cancer, our population might present a higher number of malignancies in comparison with the general population.

Secondly, our dataset consisted of 352 patients but a larger sample might have granted a more robust ML assessment. In third place, we have not included color Doppler or elastosonography information into our radiomics model, nor clinical information: this information might have led to different values of diagnostic accuracy of the ML model. Further studies are needed to address this issue. Furthermore, though we assured the explainability of our machine learning system, a deep learning approach might be a future option for building more proficient radiomics models.

Classifiers based on shallow learning algorithms are an optimal starting point for the clinical domain since they can be easily interpreted and explained, but unfortunately, they cannot achieve the performance achievable by deep models, which, however, do not guarantee explainability. As future developments, there is the intention to implement more advanced and robust techniques to implement explainability in deep learning models.

In conclusion, our study showed that an artificial intelligence-based semi-automated lesion segmentation makes the extraction of ultrasound-derived radiomics features instantaneous, still maintaining high reproducibility. Posterior acoustic shadowing was the most important feature determining the unacceptance of semi-automated segmentation. In our experience, the combination of radiomics and US BI-RADS classification has led to a potential decrease in unnecessary biopsies. Nevertheless, in our series, a not negligible increase in potentially missed cancers was observed. To avoid such a negative effect, AI-based systems must reach adequate diagnostic performance so that they can be applicable to wide populations [56]. As a consequence, further larger international multicenter studies are needed before implementing radiomics features in the routine US assessment of breast cancer.

View original article

RADIOLOGIA MEDICA

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Artificial intelligence-based, semi-automated segmentation for the extraction of ultrasound-derived radiomics features in breast cancer: a prospective multicenter study

Comments (0)