Our method has been validated on two databases 3D-IRCADband LiTs 2017. The LiTS dataset provides 130 scans and segmentation labels for liver. And 3D-IRCADb dataset provides 20 scans. One hundred ten subsets were used for training, and 40 subsets were used for testing. The training data and the testing data were separated. Segmented tumor and liver are merged into the whole liver. The data were collected from different hospitals, and the resolution of the CT scans varies between 0.45 mm and 6 mm for intra-slice and between 0.6 and 1.0 mm for inter-slices (512 × 512pixels), respectively.2 Unless otherwise specified, the following parameters are fixed in this paper: , The computation was done on a Windows 10 server with an Intel Xeon silver 4210R CPU (2.4 GHz and 64 GB memory) and Nvidia GPU GeForce Titian RTX.
3.1 Effectiveness of the proposed methodFigure 4 shows three liver labels segmentation results of the proposed method. Figure 4a,b is the segmentation results obtained by our method. Figure 4c,d is the corresponding manual segmentations. It can be seen that the results of our method are quite similar to those of the manual segmentations. Figure 5 exhibits the coronal view of segmentation results for the liver of one test image using our method. The Green lines and the red lines are the manual segmentation and the proposed method's segmentation, respectively. From the picture we can see that the proposed method's segmentation is very close to the manual segmentation.
3D view of the segmentation results for liver labels of three test images using our method. (a and b) The segmentation results by our method. (c and d) The corresponding manual segmentation
Coronal view of the segmentation results of liver labels by our method
We compared the performance of CNN + DRLSE with CNN on the same training and testing sets. An example of the segmented liver in a subject is illustrated in Figure 6. It can be seen that CNN model (Figure 6a, red line) produces poor segmentations on certain areas, mainly because of the low contrast between those areas and other segmented region. The result of CNN + DRLSEIC (Figure 6b, red line) is mostly overlapping with the ground-truth segmentation (green line) and shows fewer false-positive labeling.
Example of a liver segmentation using our method. (a) Results of convolutional neural network (CNN); (b) results of CNN + DRLSEIC
3.2 Qualitative evaluation of the segmentation accuracy Five image spatial metrics were adopted to evaluate the algorithm performance between automatic and manual segmentation,33 namely Dice Coefficient (DC), true positive rate (TPR), volume difference (VD), Jacard Index (JI), and positive predictive value rate (PPV). The definitions of each of the image metrics are given in Equations 18, 19, 20, 21-18, 19, 20, 21, and 22, respectively. (18) (19) (20) (21) (22)where S is the segmentation result, G is the ground truth, and is the complement operator of .The border voxels of the segmentation and the ground truth are represented as , . For each voxel p along a given border, the closest voxel along the corresponding border in the other result is given by , or , .
The mean surface distance is defined as: (23)where N1 and N2 are the numbers of voxels on the border surfaces of the segmentation and ground truth. The hausdorff surface distance (HSD) is similar to the mean surface distance (MSD), which is defined as: (24)The performance of our method was compared against five state-of-the-art methods: chan-vese (CV) model,34 geodesic active contours (GAC) model,35 DRLSE36 model, selective binary and gaussian filtering regularized level set (SBGFRLS)37 model, and local binary fitting (LBF)38 model. It can be seen from Figure 7 that our proposed approach yielded average Dice, JI, PPV, and TPR, respectively. The median dice scores reach 0.961 for the proposed method, followed by 0.912 for DRLSE, 0.763 for CV model, 0.772 for image visual control (IVC) model, 0.744 for LBF model, and 0.752 for GAC model. The median JI scores reach 0.941 for the proposed method, followed by 0.884 for DRLSE, 0.733 for CV model, 0.779 for IVC model, 0.714 for LBF model, and 0.682 for GAC model. The median PPV scores reach 0.948 for the proposed method, followed by 0.894 for DRLSE, 0.748 for CV model, 0.77 for IVC model, 0.742 for LBF model, and 0.751 for GAC model. The median TPR scores reach 0.978 for the proposed method, followed by 0.891 for DRLSE, 0.879 for CV model, 0.883 for IVC model, 0.914 for LBF model, and 0.878 for GAC model. All the five state-of-the-art methods produced non-liver region during level set evolution; the proposed method can control the level set contour to evolve inside the liver region. Therefore, the proposed method outperformed other methods in terms of the above several metrics.
Quantitative comparison of the proposed method with CV, LBF, distance regularized level set evolution (DRLSE), IVC, and GAC
The VD values of liver segmentation are presented in Table 1. It can be seen that the proposed method obtained a very low VD value for most of the cases. However, it is obvious that case 05 and case 27 received unsatisfactory results, mainly because more misclassified voxels were produced, which led to a significant decrease in the quantity of the VD values.
TABLE 1. The detail index of the proposed method and manual segmentation in terms of volume difference Dataset VD (%) Dataset VD (%) Dataset VD (%) Dataset VD (%) Case 01 0.055 Case 11 0.04 Case 21 0.028 Case 31 0.047 Case 02 0.027 Case 12 0.037 Case 22 0.068 Case 32 0.015 Case 03 0.011 Case 13 0.006 Case 23 0.046 Case 33 0.022 Case 04 0.083 Case 14 0.048 Case 24 0.041 Case 34 0.023 Case 05 0.112 Case 15 0.137 Case 25 0.081 Case 35 0.061 Case 06 0.072 Case 16 0.022 Case 26 0.077 Case 36 0.039 Case 07 0.045 Case 17 0.013 Case 27 0.194 Case 37 0.019 Case 08 0.077 Case 18 0.017 Case 28 0.052 Case 38 0.017 Case 09 0.092 Case 19 0.058 Case 29 0.036 Case 39 0.051 Case 10 0.053 Case 20 0.034 Case 30 0.044 Case 40 0.083 Abbreviation: VD, volume difference.The number of convolutional layer and up-sampling layer had great impact on the segmentation accuracy of a CNN. To select an optimal structure, four different convolutional layer and up-sampling layer were validated. Resulting evaluation metrics are summarized in Table 2. From the table, we can observe that the structure of 5 conv&5 up-sampling receives best performance. The input image size is , when 6 max pooling are applied, it is difficult to extract features from the feature map when 6 max pooling are applied. Therefore, the performance of the proposed CNN reduced with more extent compared with using five layers structure.
TABLE 2. Accuracy for different numbers of convolutional layers and up-sampling layers Metrics 3 conv&3 up-sampling 4 conv&4 up-sampling 5 conv&5 up-sampling 6 conv&6 up-sampling Dice (%) 0.90 ± 0.03 0.91 ± 0.02 0.958 ± 0.021 0.84 ± 0.05 TPR (%) 0.87 ± 0.03 0.835 ± 0.04 0.971 ± 0.022 0.911 ± 0.042 VD (%) 0.15 ± 0.03 0.15 ± 0.05 0.05 ± 0.034 0.35 ± 0.06 JI (%) 0.82 ± 0.02 0.835 ± 0.02 0.921 ± 0.021 0.721 ± 0.061 PPV (%) 0.961 ± 0.03 0.955 ± 0.04 0.952 ± 0.031 0.912 ± 0.021 MSD (mm) 15.33 ± 4.13 11.91 ± 2.27 9.58 ± 2.97 12.77 ± 3.35 HSD (mm) 5.74 ± 0.92 4.94 ± 1.32 3.44 ± 1.09 5.04 ± 1.03 Abbreviations: JI, Jacard Index; PPV, positive predictive value; TPR, true positive rate; VD, volume difference.The results of different network structure in terms of several evaluation metrics are recorded in Table 2. The comparison of the values of these metrics shows that the network structure of using five convolutional layers and five up-sampling layers gave more robust performance, achieving a mean Dice of , a mean TPR of , a mean VD of , a mean JI of , and a mean PPV of . Based on this experiment, a network of five convolutional layers and five up-sampling layers was established as the optimal structure of the proposed CNN.
We exhibit the influence of the level set model on segmentation accuracy in Table 3 and present the comparison of dice values with and without the level set model. It can be observed that the level set model can increase the segmentation accuracy by 1–2 percent. The reason lies in that the proposed level set model can detect clearer boundaries and thus improve the segmentation results.
TABLE 3. Comparison of our model with and without the level set evolution Metrics CNN CNN + DRLSEIC Dice (%) 0.941 ± 0.014 0.952 ± 0.017 TPR (%) 0.933 ± 0.021 0.944 ± 0.015 VD (%) 0.14 ± 0.03 0.09 ± 0.015 JI (%) 0.872 ± 0.011 0.891 ± 0.021 PPV (%) 0.914 ± 0.015 0.942 ± 0.019 MSD (mm) 11.12 ± 3.04 9.52 ± 2.74 HSD (mm) 4.28 ± 1.02 3.28 ± 0.92 Abbreviations: CNN, convolutional neural network; JI, Jacard Index; PPV, positive predictive value; TPR, true positive rate; VD, volume difference.We compared our method with other four CNN models. Table 4 shows results for the U-net, U-net++, Segnet, fully convolutional networks (FCN), and the proposed method. For a fairly comparison, we used five convolution layers for each model. The size of kernel was 3. From the table, we can see that the proposed network offered the most accurate segmentation results in comparison to the other four CNN methods in terms of Dice, TPR,VD, JI, and PPV.
TABLE 4. Comparison of different CNN segmentation methods Metrics U-net U-net++ Segnet FCN Proposed Dice (%) 0.91 ± 0.03 0.931 ± 0.03 0.901 ± 0.02 0.82 ± 0.05 0.958 ± 0.02 TPR (%) 0.88 ± 0.03 0.941 ± 0.03 0.931 ± 0.02 0.891 ± 0.03 0.951 ± 0.02 VD (%) 0.12 ± 0.03 0.07 ± 0.04 0.15 ± 0.04 0.38 ± 0.05 0.07 ± 0.02 JI (%) 0.85 ± 0.02 0.875 ± 0.03 0.781 ± 0.02 0.691 ± 0.03 0.901 ± 0.03 PPV (%) 0.961 ± 0.03 0.955 ± 0.04 0.912 ± 0.02 0.902 ± 0.04 0.931 ± 0.02 MSD (mm) 12.33 ± 2.83 10.08 ± 3.02 13.48 ± 3.56 15.77 ± 4.65 9.27 ± 3.38 HSD (mm) 4.48 ± 1.12 3.94 ± 1.02 4.74 ± 1.19 5.04 ± 1.03 3.13 ± 0.98 Abbreviations: CNN, convolutional neural network; JI, Jacard Index; PPV, positive predictive value; TPR, true positive rate; VD, volume difference.In our paired t-tests, the significance level was set as 0.05. The p-values for the paired t-tests are summarized in Table 5. The p-values of paired t-tests show that the difference between our proposed method and the other three methods is significant.
TABLE 5. p-values of paired t-tests between our model and other four methods for Dice values Metrics Dice U-net vs. Ours U-net++ vs. Ours Segnet vs. Ours FCN vs. Ours
Comments (0)