A novel artificial intelligence-based endoscopic ultrasonography diagnostic system for diagnosing the invasion depth of early gastric cancer

Study design and patients

We identified consecutive cases of EGC in which EUS was performed using a miniature probe at Osaka University between June 2009 and December 2019 to create a dataset for developing and validating the AI system. The exclusion criteria were as follows: (1) no endoscopic or surgical resection performed; (2) absence of evaluable images; (3) images from second or subsequent EUS examinations of the same lesion; (4) no evidence of cancer in the resected specimen; and (5) difficulty determining corresponding lesions in cases of multiple lesions.

As an external validation cohort, we used EUS images from EGC patients prospectively enrolled between May 2017 and January 2021 at 11 institutions from our previous study (UMIN000025862) [6]. In that study, EGC patients with suspected SM invasion on screening endoscopy were enrolled, and the exclusion criteria were as follows: (1) previous gastrectomy or esophagectomy, (2) suspected local recurrence, (3) suspected special histological type of EGC, such as neuroendocrine carcinoma, GC with lymphoid stroma, or GC of fundic gland type, (4) no expected treatment within 8 weeks of diagnosis, and (5) serious complications or multiple active cancers for whom EGC treatment is impractical. Among the enrolled patients, those who met the following criteria were excluded from the present study: (1) examination performed at Osaka University; (2) no endoscopic or surgical resection performed; and (3) inability to collect EUS images. We excluded cases from Osaka University because some of them were included in the development and internal validation datasets. EUS images of all eligible cases were retrospectively collected and used for external validation.

This study was approved by the ethics committee of Osaka University (No. 20324 and No. 22028) and performed in accordance with the Declaration of Helsinki guidelines. The requirement of informed consent was waived for this study, and all participants were given the opportunity to refuse participation using an opt-out method on the website of each institute.

EUS procedure and diagnosis

Following the diagnostic procedure by CE, EUS was performed using miniature probes with a frequency of 20 MHz or 12 MHz (UM-2R, frequency 12 MHz, UM-3R, frequency 20 MHz, or UM-DP20-25R, frequency 20 MHz: Olympus Corporation; P-2226-12, frequency 12 MHz or P-2226-20, frequency 20 MHz: Fujifilm Corporation) and an ultrasound system (EU-M2000 or EU-ME1 or EU-ME2: Olympus Corporation; SP-702 or SP-900: Fujifilm Corporation). In principle, the examination was ordinarily performed with a 20 MHz probe; only when detailed observation was difficult, it was performed with a 12 MHz probe. Lesions with the third layer of the five separated layers showing invagination, thinning, or complete destruction were diagnosed as SM2 (SM2; ≥500 μm SM invasion from the muscularis mucosae) or deeper. Otherwise, lesions were diagnosed as M-SM1 (SM1; <500 μm SM invasion from the muscularis mucosae) because the differentiation between M and SM1 is difficult with EUS. As a result, all lesions were classified as “M-SM1” or “SM2 or deeper.”

Construction of the dataset

The images collected at Osaka University were divided by period and used as the development and internal validation datasets. We excluded images that depicted lesions other than the target lesion, noisy or blurred images, and images with annotations such as arrows and text. We used all remaining images, including images that appeared to have captured normal mucosa around the target lesion and low-quality images that were inappropriate for diagnosis.

Subsequently, all EUS images in the development dataset were scored by an expert gastroenterologist based on the histological invasion depth. Due to substantial variability in EUS images regarding the suspicion of invasion and their suitability for diagnosis, it is not feasible to assess them with a simple binary value of presence or absence of invasion. Therefore, we utilized a three-vector scoring system as described below: quality score (the quality of visualization, such as layer separation), noninvasion score (the possibility of no SM invasion), and invasion score (the possibility of SM invasion) (Fig. 1a, b). Specifically, the quality was scored as 0 (favorable), 1 (intermediate), or 2 (poor) based on the quality of layer separation (Fig. 1a). The possibility of SM invasion was evaluated based on the degree of destruction of the submucosal layer as follows: no destruction of the submucosal layer and no suspicion of invasion, M-SM1 (noninvasion score: 2, invasion score: 0); slight destruction with possible invasion, M-SM1 > SM2 or deeper (noninvasion score: 1, invasion score: 0); moderate destruction with suspected invasion, M-SM1 < SM2 or deeper (noninvasion score: 0, invasion score: 1); and severe destruction with obvious invasion, SM2 or deeper (noninvasion score: 0, invasion score: 2) (Fig. 1b). However, in images where the quality of layer separation was poor (quality score: 2), it was difficult to evaluate invasion; therefore, the scores were both set to 0 (noninvasion score: 0, invasion score: 0). All combinations of scores used in this study are shown in Fig. 1c. For some of the images in the development dataset, we manually segmented the tumor, submucosal layer, and muscular layer to train the segmentation model. For the internal and external validation datasets, we merely labeled the depth information of the lesions without performing image-level labeling or segmentation.

Fig. 1figure 1

An overview of the labeling of the dataset and the AI model used in this study. a Scoring of layer separation (quality score), which was classified into three categories: favorable (0), intermediate (1), and poor (2). b Scoring of submucosal invasion (noninvasion and invasion scores), which was categorized into four groups based on the degree of destruction of the submucosal layer. If the quality score was 2, it was difficult to evaluate invasion, and both the noninvasion and invasion scores were set to 0. c All combinations of the scores used in this study. All EUS images in the development dataset were labeled with one of these tags. d Overview of the AI system developed in this study. The EUS images were first input into the segmentation model (1st step), which segmented the tumor, submucosal layer, and muscular layer. The output images were merged with the original images and then input into the classification model (2nd step), which provided the quality score, noninvasion score, and invasion score for each image. The scores were output for all images of each lesion, and the highest invasion score determined whether the lesion was classified as “M-SM1” or “SM2 or deeper”. M-SM1, mucosal cancer or cancer in the submucosa <500 μm from the muscularis mucosae; SM2, cancer in the submucosa ≥500 μm from the muscularis mucosae; EUS, endoscopic ultrasonography; EGC, early gastric cancer; CNN, convolutional neural network

Development of the AI system

We utilized PyTorch (https://pytorch.org/), a deep learning framework, to develop the AI system. In this study, we constructed the AI system as a two-step diagnostic system using convolutional neural networks (Fig. 1d). The first step consisted of a segmentation model that mapped the tumor, submucosal layer, and muscular layer in EUS images. The network of the segmentation model used UNET with ResNet34 as the backbone. The input image was resized to a square of 512 × 512 pixels, and we trained the model to maximize the Dice coefficient using the Adam optimizer. To prevent overfitting, we trained the model with data augmentation techniques such as HorizontalFlip, ShiftScaleRotate, and RandomBrightnessContrast. The map images output from UNET were mixed with the original EUS images at a ratio of 1.0:0.2 and then used as input for the following step. The parameters of the training procedure are given in Supplementary Table 1. The output images from the first step were resized to 224 × 224 pixels and then input into the second step.

The second step consisted of a classification model that simultaneously output the quality score (0–2), noninvasion score (0–2), and invasion score (0–2). The network of the classification model used a pretrained EfficientNetV2-L model. We removed the original fully connected layer and added a new fully connected layer that contained a hidden layer of 128 nodes. For parameter tuning, we split the development dataset into 5 groups and performed fivefold cross-validation. All original layers of EfficientNetV2-L and the new fully connected layer were trained. We trained the model to maximize the mean AUC of the three scores using the rectified Adam (RAdam) optimizer and root mean square error as the loss function. To prevent overfitting, we trained the model with data augmentation techniques such as HorizontalFlip, ShiftScaleRotate, and RandomBrightnessContrast. The parameters of the training procedure are given in Supplementary Table 2. Finally, we used an ensemble model that consisted of 5 models obtained from the fivefold cross-validation as our AI model. Averaging was used as the ensemble technique. Based on the data exploration in the development dataset, the maximum invasion score for each lesion was found to particularly contribute to the depth of invasion (see Supplementary Method; Supplementary Fig. 1). Therefore, only the invasion score was used for depth of invasion diagnosis, and the other two scores were not utilized. All training and inference were performed in a local environment using an Intel Core i9-12900K as the central processing unit and a GeForce RTX3090 as the graphics processing unit.

Visualization of regions of interest (ROIs) for the AI model using class activation mapping (CAM)

To investigate the ROI of the developed AI model, we performed visualization using CAM. In this study, we employed the Eigen-CAM method of CAM. We obtained the feature maps corresponding to the output of each class and weighted the output value by multiplying it by the class output. We obtained these maps for each of the 5 models in the ensemble model and averaged them to create a visualization map for the input image. We implemented these codes using the PyTorch-grad-cam library for PyTorch (https://github.com/jacobgil/pytorch-grad-cam).

Training of CycleGAN model

We addressed the domain shift problem of the external validation dataset by using CycleGAN [29]. We used all EUS images derived from the EU-M2000 system (Olympus) in the development and internal validation datasets as well as all EUS images derived from the EU-ME1 and EU-ME2 systems (Olympus) in the external validation dataset as the training dataset for CycleGAN. We trained the model for a total of 30 epochs, with each epoch consisting of the full set of images. We implemented these codes using PyTorch-CycleGAN-and-pix2pix (https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix).

Outcome measures

The primary outcome was the diagnostic performance of the developed AI system for the classification of “M-SM1” and “SM2 or deeper” per lesion. As a secondary outcome, we compared the diagnostic performance of the AI system with that of gastroenterologists. In the internal validation dataset, we compared the diagnostic abilities of the AI system, six expert gastroenterologists, and eight nonexpert gastroenterologists. The expert gastroenterologists were those who met all of the following criteria: (1) more than 10 years of experience in gastrointestinal endoscopy, (2) experience with more than 30 cases of EUS for EGC, and (3) board certification as a fellow of the Japan Gastroenterological Endoscopy Society. Nonexpert gastroenterologists were those who did not meet at least one of these requirements. For internal validation, both expert and nonexpert gastroenterologists reviewed only all EUS images of each lesion and classified each lesion as either “M-SM1” or “SM2 or deeper.” When the diagnosis differed between images, the diagnosis was based on the image that appeared to reflect the deepest area of the lesion. For external validation, real-time EUS diagnoses by expert gastroenterologists at each institution were used.

In the AI system, an inference process was performed for all images of each lesion using the developed model, and the maximum invasion score was considered the score for that lesion. The diagnosis of “M-SM1” or “SM2 or deeper” was determined based on whether the score exceeded a threshold value. We calculated the diagnostic performance for all values of the invasion score and adopted the point closest to the performance of the experts as the threshold value.

Statistical analysis

We compared the performance of the AI system and gastroenterologists by calculating the accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The 95% confidence intervals (CIs) of those indicators were also calculated. Pearson’s chi-square test and the McNemar test were used to compare the diagnostic performance among evaluators. The receiver operating characteristic (ROC) curve and area under the curve (AUC) were used to represent the classification performance of our model using Python. A p value less than 0.05 was considered statistically significant. All statistical analyses were performed using JMP Pro version 16 (SAS Institute, Inc., Cary, NC, USA) and R version 4.2.1 (The R Foundation for Statistical Computing, Vienna, Austria).

Comments (0)

No login
gif