This study was conducted in accordance with the Declaration of Helsinki and national regulations. The study was approved by the University Hospital Centre of São João, Porto, Portugal, which included approval by the institutional Ethics Committee and the Responsible for Data Reuse. The informed consent of the participants was waived due to the retrospective nature of the research.
Image DatasetA PET image dataset of pulmonary nodules was created. To ensure the quality of the data for modelling, the eligible population, the reference standard and the sampling procedure were first determined. Then, the data were collected and preprocessed.
Eligible PopulationThe participants belong to the eligible population if they cumulatively meet the following inclusion criteria:
One or more indeterminate solid pulmonary nodules with more than 8 mm in average diameter. The average diameter should not exceed the 30 mm, according to the nodule definition provided by the British Thoracic Society guidelines [8]. The average diameter of the nodule corresponds to the average of long-axis and perpendicular short-axis diameters, both of which are obtained on the same orthogonal slice, such as defined in the Fleischner Society Guidelines [7];
The nodule detection was incidental or through screening;
2-[18F]FDG PET/CT was performed for clarification of the nodule(s), and the reconstructed images are available in digital format. The pathological status of the nodule(s) is unknown at the time of the PET/CT (indeterminate nodule);
The nodule was biopsied or excised and obtained a histopathological or cytopathological examination, otherwise completed an imaging follow-up period.
Those with at least one of the following criteria were excluded:
History of lung cancer;
History of other cancers, except:
Non-melanoma skin cancer, low-risk localised prostate cancer, in situ cervical cancer, in situ breast cancer, or superficial bladder cancer, which has been treated at least 6 months ago.
Reference StandardThe reference standard for the pulmonary nodule status was defined on the basis of the histopathological or the cytopathological examination, and/or the nodule behaviour during the follow-up period with CT. It attributes one of two classes (benign or malignant) to the target feature which is the status of each pulmonary nodule as following:
1.A nodule is defined as malignant if biopsied or excised during the initial diagnostic workup or during the follow-up period, and the histopathological or cytopathological examination shows a malignant neoplasm.
2.A nodule is defined as benign if:
a.Excised and the histopathological examination showed benign pathology;
b.Biopsied, the biopsy was diagnostic and the histopathological examination showed a benign pathology;
c.Neither excised nor biopsied, or biopsied but non-diagnostic and during follow-up:
a.The nodule disappeared;
b.The nodule decreased or kept the same size for, at least, 2-year of follow-up;
c.The nodule increased in size and thereafter was biopsied or excised and the histology was benign;
d.Volume doubling time >600 days and <25% change in volume for, at least, 1 year of follow-up.
A minimum of 2-year imaging follow-up was established for solid nodules when the mean axial diameter of the nodule was used for follow-up. When the follow-up period was between 1- and 2-year, the nodular volume was estimated from the diameter on three orthogonal axes. These follow-up criteria are based on the doubling time of malignant solid nodules and are recommended for pulmonary nodule management [7, 8].
SamplingEvery patient referred to the University Hospital Centre of São João and who underwent a 2-[18F]FDG PET/CT scan between 2010 and 2019 was consecutively selected if he/she belongs to the defined population.
If a patient underwent more than a PET/CT scan, only the first one was considered. If a patient has more than one nodule that fills the eligibility criteria, only the more suspicious was included.
Among the 7130 PET/CT scan requests within the established time interval, the 2-[18F]FDG PET/CT scans that aimed at clarifying the diagnosis of pulmonary nodules were selected. Then, the eligibility criteria were checked for those by consulting the medical records and the information of the histopathological/ cytopathological examination, the standard-dose CT scan and the 2-[18F]FDG PET/CT scan. In the end, 113 participants were eligible to create a PET image dataset.
Image Acquisition, Preprocessing, and AnnotationAll patients underwent a PET/CT scan with a field of view between the skull base and mid-thighs around 60 min after the 2-[18F]FDG intravenous injection. The exams were acquired in three different scanners (GE Discovery IQ 4R, GE Discovery LS/4 and Siemens Biograph 6). The PET images were reconstructed using the ordered subset expectation maximisation method. Attenuation correction of PET data was performed with low-dose CT-derived attenuation maps.
The image preprocessing was performed on 3D Slicer 4.10.2 r28257 [39]. Both the PET and CT image files were imported and coregistered with rigid registration. Once the PET/CT scans were performed in different scanners, the PET volumes have different voxel size and anisotropic spacing. Therefore, the volumes were resampled to obtain the same voxel size and isotropic voxels. The voxel side was set to 1.5 mm which is a smaller size than the smaller voxel side of the three scanners. Linear interpolation was used for spatial resampling.
The nodule was visually identified in the coregistered PET/CT images, and a cubic region of interest was drawn and cropped to include the entire nodule. The center of this subvolume coincides with the center of the nodule. The subvolume of interest has a side length equal to twice the maximum possible diameter of the nodule (60 mm × 60 mm × 60 mm). The obtained subvolume was saved in .nrrd format. Each cropped subvolume containing a pulmonary nodule was annotated with the corresponding class of the target feature (benign or malignant).
Formulation of the Deep Learning TaskThe supervised deep learning problem is a single task, single label, binary classification problem that inputs cubic regions of interest from PET for a three-dimensional (3D) CNN.
Let X be a random variable that represents an input, i.e. a PET image, which corresponds to a tensor, being the axes 1, 2 and 3, the shape of the volume-of-interest (40 × 40 × 40) and the axis 4, the number of channels, in this case only one. Let Y be a random variable which corresponds to the target. Let S be a training set with n pairs (X, Y) of independent and identically distributed samples drawn from the population. Then, the learning problem consists of using a CNN-based algorithm for choosing from the hypothesis space, the hypothesis or model that best approximates an unknown mapping function f: X→Y in the population, using the training set as a starting point. Model training is performed by discovering the parameter configuration that minimises a loss function in the training set, the structural risk, a surrogate of the expected risk [40, 41]. However, minimising the risk in the training set is prone to overfitting and a dissociation between the expected and structural risks occurs at any time during the training [42]. An estimate of expected risk in the validation set is more accurate, but cannot be used to update model parameters, so it may be used to decide when to stop training [42].
Experimental SetupInput Data SplittingThe dataset was randomly split into five stratified partitions of similar size. The stratification was performed by the target class in order to maintain the same class distribution of the original data in each data partition. Four partitions were used for 4-fold cross-validation, and the fifth one was reserved for testing. In each fold of cross-validation, three out of four partitions were used for training, and the remaining one was for model evaluation. Therefore, 4-fold cross-validation was used for training, evaluating, hyperparameter tuning and comparing different models that were built from different network architectures and, in the end, for choosing the best model. Cross-validation was preferred because it guarantees lower variance than the holdout method for the size of the obtained dataset [43].
Since tuning a model is a repetitive process, there is some leakage of information from the validation partition into the model, even it is not directly trained on it, resulting in overfitting of the model to the validation set and optimistic performance metrics [44]. For obtaining of unbiased estimates of the model performance, a test set partition was used only once to evaluate the best model, which was selected among all those trained during the cross-validation phase.
The input data for the network were subjected to fold-specific min–max normalisation to the range [0, 1]. The validation and test sets were also normalised with values of the training set of the respective fold. Data were randomly shuffled on every epoch during the training.
Figures 1 and 2 represent the middle axial slice of each PET volume that composes the cross-validation dataset grouped by the target class. The test set was not represented to avoid information leakage during the construction of the models.
Fig. 1Cross-validation dataset. Middle axial slice of each PET volume. Malignant pulmonary nodules
Fig. 2Cross-validation dataset. Middle axial slice of each PET volume. Benign pulmonary nodules
Data AugmentationOffline data augmentation [45] was performed independently in each of the four partitions of the original dataset previously created for cross-validation. Translations, rotations and Gaussian noise injection were separately applied to the original images. The test set was not augmented. The augmentation factor for each type of operation was class-specific in order to perform class balancing (Table 1). The dataset comprises original and augmented images, having around 4900 images. The size of the augmented dataset was determined by the computational resources available for training the models in a larger dataset. During the cross-validation, the models were trained in an augmented training set on each fold. The evaluation occurred in the corresponding validation set of the original dataset.
Table 1 Data augmentation factorTranslations were random shifts between −10 and 10 pixels on any of the 3 axes of each original image. A maximum amplitude of 10 pixels (15 mm) was chosen to ensure that the nodules were not moved out of the tensor and the label was preserved.
Random rotations between −45 and 45° were separately applied around the x-, y- or z-axis of each image so that each one yields augmented examples with different rotation axes, but each new augmented example has a rotation applied only around a given axis. Since the rotations were around an axis which runs through the centre of the tensor, a rotation was actually a composite operation (Protated = T−1 × R × T × P), where P is the voxel, T a translation operation and R a rotation operation [46]. After the spatial transformation of the coordinates of the voxels, an intensity interpolation with a bilinear interpolator was applied.
Gaussian noise with a mean of zero and three different values of standard deviation (0.1, 0.3 and 0.5) was added to the original images. The reason for adding Gaussian noise was to be able to model the PET image noise [47], so that different augmented images simulate PET images with different levels of noise.
The background voxels were filled with zero in all the above operations.
Training ProceduresThe experiment was run in R language [48]. R Interfaces for Tensorflow (v. 2.2.0) [49] and Keras (v. 2.3.0.0) [50] and r-reticulate package [51] along with Tensorflow 2.1.0 [52] and Python 3.7.8 [53] were employed. The graphic card used was an NVIDIA GeForce MX150.
Models were trained with binary cross-entropy loss [41] and Adam optimiser [54] throughout the entire experiment. The learning rate was tuned until the optimal value was reached. The learning rate of the different models is shown in the Table 2.
Table 2 Models trained by cross-validationThe stopping criterion of the training corresponds to the minimum validation loss with a patience of ten or a maximum of 100 epochs. The model derived from the training epoch with the lowest validation loss was saved. This procedure was repeated for each fold of the 4-fold cross-validation, resulting four model versions, which have different values of parameters, but identical hyperparameter configuration. Early stopping ensures that the minimisation of the structural risk does not occur beyond the point of the best generalisation, obtaining a regularising effect [55].
The original dataset was trained with full-batch learning or with a mini-batch learning with batch size of 16, according to the type of network. Mini-batch learning with batch size of eight was preferred with augmented data.
Other specifications of the training procedures were changed according to the network architecture or even in networks of the same architecture (i.e. treated as hyperparameters), being explained in more detail in the next section.
Network ArchitectureThree types of 3D CNN architectures with volumetric inputs were defined. These networks were generalised from the homologous 2D CNNs (Alexnet [56], VGGNet [57] and Inception-v2 [58]), and the size of the networks was adapted to the complexity of the problem and the size of the dataset. As such, number and arrangement of layers, number of filters, kernel size and other network specifications were treated as hyperparameters, which were tuned until the proposed models were found. These networks were trained using either the original or the augmented datasets. Additionally, a 2D pre-trained model was fined-tuned in the original dataset. Some details of the different network architectures are in the Table 2.
Leaky ReLU (with α = 0.3) was the preferred activation function in 3D CNN because of allowing a small, non-zero gradient when a unit is not active and thus prevents ‘dying ReLU’ [59, 60].
Weights were randomly initialised according to the scheme proposed by He et al. [61], which was specifically developed to address the rectifiers.
A network architecture inspired by Alexnet [56] was proposed. Named as Stacked 3D CNN, it is characterised by four 3D convolutional layers and three 3D max-pooling layers alternately stacked and connected to three fully connected layers (32, 16 and one units, respectively). The first convolutional layer has eight filters. Network width increases along the convolutional base by doubling the number of filters every convolutional layer. The kernel size was 3 × 3 × 3, and the kernel stride was one in the convolutional layers. No padding was applied. Pooling layers consist of max-pooling operations with kernel size of 2 × 2 × 2, stride of two and no padding. The first fully connected layer receives the output of the last convolutional layer after being flattened into a vector (Fig. 3).
Fig. 3Architecture of the Stacked 3D CNN network (final model)
The VGG-like network is characterised by a total of ten layers, being multiple stacked convolutional layers, some of them followed by a pooling layer. The output of the last convolutional layer is flattened before a fully connected output layer (Figure A1 of Supplementary Information). The efficient use of 3 × 3 × 3 convolutions is a prominent property of this type of network [57]. Thus, convolutions of larger kernel size are factorised into 3 × 3 × 3, while the receptive field is preserved. Consequently, the depth of the network increases, while the number of parameters is reduced. More specifically, 5 × 5 × 5 and 7 × 7 × 7 convolutional layers are replaced by sets of two or three 3 × 3 × 3 convolutional layers. Factorisation of convolutions imposes a greater reduction in the number of parameters in a 3D than in a 2D network, and therefore a greater regularising effect (Section A.1.1 of Supplementary Information). Both expansion of feature maps and decreasing of the spatial resolution only occur after each pooling layer. Overlapping max-pooling [56] with a pool size of 3 × 3 × 3, strides of two and padding was applied.
The Inception-v2-like network is a 3D CNN with three main characteristics—inception modules, 1 × 1 × 1 convolutions and factorisation of convolutions—which introduce sparsity in the network and reduce the number of parameters, making the network more efficient [58]. Inception modules consist of blocks of several convolutional layers with different kernel size and a pooling layer that receive the same input, propagate the information in parallel and concatenate the output before passing it to the next layer [58]. Much of the computational efficiency is achieved by using 1 × 1 × 1 convolutions to compute reductions of the number of feature maps before expensive convolutions of larger kernel size. Factorisation of convolutions of larger kernel size into stacks of 3 × 3 × 3 or asymmetrical convolutions while preserving the receptive field provides a further increase in efficiency. There are two types of inception modules—one standard module for learning representations and another one that simultaneously downsizes the feature maps [58]. Two versions of each were implemented in the network. The proposed network has four standard inception modules and two reduction modules. The output of the last convolutional layer is converted into a vector by global average pooling, which is received by a fully connected output layer (Figures A2 to A6 of Supplementary Information).
Transfer learning [62] from Imagenet dataset [63] was also performed. Pre-trained Resnet-50 [64] was used as a feature extractor. Two fully connected layers were added to its top and initialised according to He et al. [61]. Because the dataset of the current problem is quite different from that of the source domain, only the earlier layers of ResNet-50 were used (until conv3_block1_out). Additionally, a few of the top layers (from conv3_block1_1_conv) were fine-tuned with a very low learning rate. Due to the 2D architecture, the input for this network consists of 3 central slices (19, 20 and 21) of the PET volume, which are orthogonal to the third axis, being each one stored in a different channel.
Besides the regularisation procedures already described, L2 regularisation [42] was applied to all layers with parameters of the 3D CNN models, whereas dropout [65] was applied to the fully connected layers of the 2D CNN model.
Performance Metrics and Model SelectionThe performance metric selected to evaluate the models was the area under the receiver operating characteristic (ROC) curve. It was computed with the trapezoidal rule from non-parametric ROC curves [66]. During the cross-validation, model evaluation was conducted in the validation partition of the respective fold. Different models were compared by their mean area under the ROC curve of the 4-folds.
Models with different network architectures were trained. In order to deal with a source of non-determinism on Tensorflow GPUFootnote 1, the best model of each network architecture was retrained and evaluated again by 10 iterations, under identical conditions, on the 4-fold cross-validation. The average performance metrics over the 10 iterations for the different models were compared, and the best model was selected. Since that model has 10 versions for each fold, one of them was randomly picked.
Subsequently, an ensemble classifier was built from the four versions of the best model derived from the 4-fold cross-validation, by averaging their output probabilities, weighted by the size of each training partition. This ensemble classifier was evaluated in the test set to determine its generalisation performance over unseen examples. The 95% confidence interval of the area under the ROC curve was also determined for the test set according to the method described by DeLong [67].
Accuracy, sensitivity and specificity were complementary metrics determined in the test set. Instead of using the standard decision threshold of 0.5, an optimal decision threshold was determined for each version of the best model in the respective validation partition. The four decision thresholds were averaged, and the resulting threshold was applied to convert the output probabilities of the ensemble model into classes, in the test set. If the predicted probability was equal to or higher than the threshold, the nodule was classified as malignant; otherwise, it was classified as benign. The optimal threshold was determined according to two different approaches. In the first one, the value of the optimal threshold was the posterior probability which maximises the Youden index [68]. In another scenario, the cost of a false negative was considered higher than the cost of a false positive. Therefore, a minimum sensitivity was set to 95%, and the cut-off point which maximises the specificity was searched.
Paired Comparison Between the Final CNN Model and the SUVmaxAs a secondary analysis, a hypothesis test was performed, in the test set, to infer about a possible difference in the area under the ROC curve between the final CNN model and the SUVmax in the population (H1: AUC ROCCNN ≠AUC ROCSUVmax), starting from the null hypothesis of equality. The type I error (α) was predefined as 0.05.
The non-parametric test developed by DeLong et al. [67], which makes a paired comparison of the area under the ROC curves, is applied if the area of one ROC curve is uniformly higher than the other across all operating points, that is, the curves do not cross each other; otherwise, a hypothesis test based on the ROC shape proposed by Venkatraman and Begg [69] is applied.
Model ExplainabilityGrad-CAM analysis [70] was applied to generate visual explanations for the decisions of the model. This method highlights the most class-discriminant regions of a volume-of-interest under the 3D CNN classification model standpoint. Insights about how the model succeeded or failed were obtained. The Grad-CAM 3D heatmap was obtained for each PET volume of the test set from each of the four 3D CNN model versions which compose the ensemble model. Fusion images were created by superimposing the axial slices of the Grad-CAM 3D heatmap and the axial slices of the original input PET volume for selected cases. The volumes were reprocessed to obtain ten axial slices rather than forty to facilitate the representation of the images. Red and dark red tones represent higher Grad-CAM score for a class, as such they were the most relevant regions of the input volume for model decision.
Comments (0)