A Machine Learning Model Based on Global Mammographic Radiomic Features Can Predict Which Normal Mammographic Cases Radiology Trainees Find Most Difficult

Readers

Ethical approvals of this research were acquired from the University of Sydney’s Human Research Ethics Committee [protocol no. 2019/013 and 2017/028], which included obtained written informed consent from each reader participating in the study.

A total of 137 Australian and New Zealand RTs who participated in the BREAST (Breastscreen REader Assessment STrategy) programme [28] from 04 September 2014 to 23 February 2021 were included in the study. The mean age of the readers was 33 based on the age details provided by 116 RTs, while 21 did not provide their age details. In total, 19% of the readers engaged in the programme through a radiology/BC conference/workshop in a simulated reading room, while 81% completed the programme online through the BREAST platform (https://breast-australia.sydney.edu.au/) at their usual clinical workplace. Reading conditions in both settings were equivalent with cases being presented on two 5-megapixel medical standard monitors including ambient light levels range of 20–40 lx [29]. RTs generally had less than 1 year experience reading mammograms (59%), read less than 20 mammographic cases per week (79%), and spent less than 4 hour per week reading mammograms (96%). Also, 16 readers (12%) completed a fellowship in breast imaging that lasted for 3 to 6 months and had any experience reading images for a national breast screening programme (Table 1).

Table 1 Characteristics of readers (n = 137)Mammographic Normal Cases

This retrospective study includes data from readers that completed seven BREAST test sets, each comprising 60 de-identified full-field digital mammographic (FFDM) cases, with 40 cases being cancer-free and 20 cases indicating cancer. For the purpose of this investigation, only normal cancer-free cases were included. Thus, a total of 280 normal FFDM cases (comprised of four images per case: two bilateral craniocaudal/CC and two mediolateral oblique/MLO views) acquired from screening asymptomatic women aged between 40 and 75 (mean and median = 57, standard deviation = 7, and range = 33 based on the age details of 130 women/cases, while ages of 10 women were not available due to anonymisation process) were utilised. These cases were collected from various mammography machines, including Fujifilm (Fujifilm Corporation, Minato City, Tokyo, Japan), GE (GE Healthcare, Chicago, IL, USA), Hologic (Hologic, Inc., Marlborough, MA, USA), Philips (Philips Healthcare, Amsterdam, the Netherlands), Sectra (Sectra, Linköping, Sweden), and Siemens (Munich, Germany). Each normal case’s ground truth underwent confirmation of cancer-free status by at least two independent expert radiologists, each possessed over 20 years of experience, with validation through subsequent negative screen outcomes.

Categorising Hardest- vs Easiest-to-Interpret Normal Cases

To categorise normal cases as hardest- vs easiest-to-interpret, we initially calculated difficulty scores for each of the 280 normal cases using the Royal Australian and New Zealand College of Radiologists (RANZCR) scoring system [28] provided by the 137 RTs. A RANZCR score of 1 represents a normal case, while a score of 2 indicates that RTs considered the case as benign. Scores of 3, 4, or 5 signify an equivocal or malignant finding observed by the readers with higher numbers indicating a greater confidence of disease presence. Difficulty scores were computed by dividing the number of RTs who misclassified a normal case as cancer (provided a rating of 3, 4, or 5) by the total number RTs who read the test set. Subsequently, the 280 normal cases were then classified as hardest- and easiest-to-interpret cases based on the 75th and 25th percentiles of the cases containing the highest and lowest difficulty scores, correspondingly. This resulted in 70 hardest- and 70 easiest-to-interpret normal cases, totalling of 140 normal cases/560 DICOM images combined. To ensure a clear distinction between the most and least difficult normal cases in order to enable optimal feature learning for our machine learning model, only images in the upper and lower quartiles were used for further analysis. Normal cases in these two categories also resulted in a total of 59 low-density (35 hardest- and 24 easiest-to-interpret) and 81 high-density normal cases (35 hardest- and 46 easiest-to-interpret).

Global Radiomic Feature/Score Per Case

To obtain a global radiomic feature/score for each of the 140 normal cases, a total of 34 quantitative GMRFs of 560 images belonging to 140 cases extracted from the previous study [11] were used. Details about the GMRFs extraction method can be found in the previous study [11]. Briefly, from the 560 DICOM images, we first generated 560 binary masks using a standardised gray level threshold value set at 100 (manual adjustments on the value were also performed where needed) to extract the required breast region from its background while eliminating unwanted artefacts and labels. After that, we converted the DICOM images and masks to a TIFF file format, flipped all the right CC and MLO images to the left to have a consistent left-side chest wall on all images, and cropped the TIFF images and masks based on the maximum size of breast region. These cropped TIFF images and masks were used as input for the radiomic analysis (Fig. 1). We then outlined the region of interest on the images using the lattice-based/ROI (multiple regions of interests covering the entire breast image) [30] and squared-based/SQ (largest square rectangular box inscribed within breast) [31] approaches. The 34 handcrafted GMRFs/scores per image were afterward extracted based on the region of interest delineated using our in-house MATLAB platforms and then normalised to calibrate image intensity mean to zero and standard deviation to one using a common z-score normalisation method [32]. Features using the ROI approach [30] were analysed using MATLAB distinct block processing technique (block size 214 × 214 pixels) and summarised using the standard deviation method. The extracted GMRFs of 4 images (i.e. 2 CC and 2 MLO images) belonging to a case were then combined and averaged to obtain one global radiomic feature/score per case.

Fig. 1figure 1

Radiomics workflow. 1. Input mammographic images and masks were acquired. 2. Region of interest (yellow colour region) was delineated using lattice- and squared-based approaches. 3. A set of 34 global mammographic radiomic features (GMRFs)/scores per image were extracted from the yellow region of interest, then standardised to have image intensity mean equal to zero and standard deviation equal to one. The extracted GMRFs of four images belonging to each of the 140 cases were then averaged to obtain one GMRF per case. 4. Finally, using the averaged GMRFs, a random forest machine learning model for differentiating radiology trainees’ hardest- from easiest-to-interpret normal cases was constructed and assessed

The 34 GMRFs consisted of 30 Gray level co-occurrence matrix/GLCM-based Haralick texture features [33], 2 neighbourhood gray tone difference matrix/NGTDM-based texture features [34], and 2 first-order statistics/FOS-based features [25] (Table 2 and Fig. 1). These features were chosen because of their usefulness in describing mammographic appearances in measuring the contrast values of spatial inter-relationships between neighbouring pixels (GLCM and NGTDM) and the distribution of single pixel intensity value within the image region of interest (FOS) [25, 33].

Table 2 Extracted global mammographic radiomic features (n = 34)Model Building

For the task of differentiating between normal cases that are hardest- and those of easiest-to-interpret for RTs, a random forest machine learning model was built using the averaged 34 GMRFs feeding through MATLAB ensemble of decision trees boosted with the adaptive logistic regression method (i.e. LogitBoost). The random forest with boosting approach was chosen due to its ability to produce an explainable model with automatic estimation of feature importance and its built-in feature selection approach which could minimise feature overfitting and potential selection bias problems [35]. Important GMRFs were recognised based on the analysis of feature importance scores using MATLAB’s predictor importance algorithm, which also helps in identifying and mitigating redundant and biased features. Feature importance scores indicate how significant each feature was in the dataset when building the predictive model with larger value means larger effect the features had on the model in predicting hardest- from easiest-to-interpret normal cases of RTs (Fig. 1).

Statistical Analysis and Validation

In order to examine the performance of the model on our data and to maximise the use of the available data for both training and validation, we trained and validated the model using the resampling leave-one-out-cross-validation (LOOCV) approach (an unbiased, dependable, and accurate validation method for assessing a machine learning model’ generalisation performance) [36]. Each time when training the model, all cases were used, except one case which was left out and used once as a test set to validate the model’s predictive performance. We repeated this process 140 times (as per the total number of the normal cases we had) until each case was left out and used once as a test set. The model’s performance was then evaluated based on how accurately it predicted the left-out/unseen case in each repetition, providing an effectively estimation of how well the model generalises to the unseen cases. The overall performance of the model for discriminating hardest- from easiest-to-interpret normal cases of RTs was assessed using the AUC.

A Kruskal–Wallis test was employed to investigate if the 34 GMRFs differed between hardest- and easiest-to-interpret normal cases of RTs, and if difficulty level differed among low- vs high-density normal cases of RTs. A p-value of less than 0.05 was considered statistically significant.

Moreover, a scree test of exploratory factor analysis [37] was used to determine the usefulness of the GMRFs based on the sum of their importance scores from the model.

Radiomics and statistical analysis were performed using MATLAB R2022a (MathWorks, Natick, MA, USA).

Comments (0)

No login
gif