Medical image segmentation is the process of partitioning images, such as Magnetic Resonance Imaging (MRI) and Computed Tomography (CT), into various segments or regions to delineate and emphasize specific anatomical structures or areas of interest. This critical step in medical image analysis plays a significant role in enhancing diagnostic precision, aiding in treatment planning, and facilitating effective patient management (Antonelli et al., 2022). Advanced techniques predominantly employ fully or partially supervised models (Fan et al., 2024, Isensee et al., 2021), capitalizing on prior information derived from paired data, often matching or even exceeding the accuracy of expert radiologists. However, these methods assume the availability of comprehensive annotation sets for training, a substantial limitation in the medical field where time, costs, and expertise are significant constraints.
Self-supervised and few-shot medical image segmentation methods (Wang et al., 2020a, Wang et al., 2019, Liu et al., 2023a) are being explored as alternatives to address the limitations of data availability, aiming to decrease the dependence on comprehensive, well-representative annotation sets. Self-supervised methods typically require numerous training data to prevent overfitting (Snell et al., 2017, Wang et al., 2019) and involve fine-tuning with a substantial amount of labeled data (Grill et al., 2020, Wang et al., 2020b). Few-shot segmentation methods use support images to learn meaningful prototypical representations for each label (Butoi et al., 2023, Cheng et al., 2024) or use prompts to incorporate prior knowledge for accurate and generalized feature representation, enabling the model to segment unseen data (Ma et al., 2024, Wang et al., 2023a). These methods learn representations for only one specific label at a time, leading to extensive labor intervention or computational overhead when processing multiple labels simultaneously, even when handling 2D image slices (Butoi et al., 2023). This limitation restricts the application of such methods to multi-organ segmentation, especially when neighboring target anatomical structures, such as brain tissues, are highly curved and folded.
Another effective type of few-shot segmentation method commonly employs deep atlas-based models, such as VoxelMorph (Balakrishnan et al., 2019). These models register a reference image (atlas) to target unlabeled images and propagate the atlas labels to each target image using the registered transformation. However, the unsupervised registration process and voxel intensity variations often result in limited spatial transformation performance and instability across datasets (Liu et al., 2023a). To overcome these limitations, some approaches leverage forward–backward consistency between atlas and target images (Wang et al., 2020a), or estimate bi-directional spatial transformations while ensuring inverse consistency (Zheng et al., 2022). Despite these efforts, aligning complex tissues (e.g., brain images) and regions with high variability (e.g., abdominal images) can lead to misalignment errors and distorted, unrealistic outputs in the registration network. To address this issue, studies (Kang et al., 2022, Zhao et al., 2019b, Lv et al., 2022) propose using multi-scale or cascaded Convolutional Neural Networks (CNNs) to decompose the target deformation field. While effective, these methods can blur the distinction between different tissue types due to reduced image contrast and sharpness, challenging segmentation algorithms to accurately delineate boundaries between anatomical structures, as they rely on image similarity without anatomical guidance.
In this paper, we develop an efficient and easily trainable end-to-end model, EFS-MedSeg, for accurate few-shot semantic segmentation of multiple target classes at once. Fig. 1 illustrates the superior segmentation performance of our proposed EFS-MedSeg model compared to the current state-of-the-art few-shot medical image segmentation techniques. The main contributions of this work are fourfold.
•We propose an efficient representation learning framework that jointly optimizes supervised segmentation with labeled atlases and self-supervised reconstruction with unlabeled images. This approach uses a 3D random regional switch strategy to enhance training data variability, improving generalizability and feature discrimination.
•We introduce an adaptive attention mechanism that guides the model to balance volume discrepancies between tissues and focuses on areas with lower Dice similarity scores, thereby preventing the model from overlooking small-volume tissues and enhancing its feature extraction and segmentation performance.
•We introduce a self-contrastive module to incorporate anatomical shape and structural priors. This module not only maintains similarity in appearance with the atlas image but also efficiently identifies symmetric differences in labels, enhancing model convergence and resulting in natural and smooth label boundaries.
•Experiments on two MRI datasets and one CT dataset show that our method converges faster and consistently outperforms existing state-of-the-art methods on all three few-shot segmentation tasks. Results on an additional CT dataset with pathological spine abnormalities highlight our method’s strong potential for addressing complex segmentation tasks with minimal supervision.
Comments (0)