Ultrasound imaging has become an essential tool in clinical diagnostics due to its affordability, portability, safety (being radiation-free), real-time capability, and non-invasive nature. It is particularly prevalent in the examination of superficial organs such as the breast (Spak et al., 2017) and thyroid (Tessler et al., 2017). Beyond these, ultrasound is extensively used in a wide range of applications including ovarian evaluations (Wu et al., 2018), cardiac function assessments (Folland et al., 1979), prenatal fetal monitoring (He et al., 2021), carotid artery examinations (Stein et al., 2008), liver (Ferraioli and Monteiro, 2019) and kidney function tests (Mostbeck et al., 2001), and appendicitis diagnosis (Mostbeck et al., 2016), among others. Despite its widespread usage, traditional ultrasound image analysis models are often developed for specific tasks or organ-specific datasets. These single-purpose models lack flexibility and generalization, making it challenging to scale across the diverse range of ultrasound imaging scenarios encountered in clinical practice. Furthermore, deploying multiple single-purpose models increases computational and storage demands, limiting their practicality in real-world applications. This highlights the need for a unified, automated solution capable of handling various tasks and organs efficiently, thereby enhancing diagnostic accuracy and consistency.
Existing self-supervised pretraining ultrasound foundation models are often less efficient compared to supervised pretraining, requiring a vast amount of data and significant computational resources. Furthermore, these models typically require extensive task-specific fine-tuning during deployment, whereas supervised pretraining can often be applied directly. Although supervised pretraining is a relatively efficient approach, it still faces several challenges. Firstly, as shown in Fig. 1, traditional ultrasound image analysis models are usually trained on specific and limited datasets. These models are designed to handle a particular type of ultrasound analysis problem, which restricts their scope and makes them incapable of effectively dealing with the diverse ultrasound images in clinical practice. Moreover, when deploying such models on devices, the use of single-purpose models requires the deployment of multiple models, thereby imposing limitations on storage and memory. Secondly, in the development of an orchestration multi-organ ultrasound model, one has to confront the complexity and diversity of ultrasound images. There are significant variations in tumor images and anatomical structures, and the image scales can also be inconsistent. The differences among ultrasound images from different body parts make unified processing arduous. Thirdly, many existing general-purpose medical image models encounter difficulties during fine-tuning on new data. Fine-tuning often results in catastrophic forgetting, causing the model to lose previously acquired knowledge and leading to poor generalization abilities. These challenges clearly indicate the necessity for innovative solutions to construct robust and versatile ultrasound image analysis models.
The challenges in developing an orchestration ultrasound image model are not unique and are also present in many other modalities and datasets. Numerous studies have proposed solutions to these challenges.
One significant approach is self-supervised learning, which involves designing an appropriate proxy task to learn patterns from large-scale unlabeled data and then transferring this knowledge to downstream tasks. For example, Wu et al. (2023) proposed BROW, a foundational model for whole-slide images based on self-distillation, which improves performance through multi-scale input and color augmentation, excelling in various downstream tasks. Li et al. (2023) developed D-LMBmap, an automatic deep learning pipeline for whole-brain neural circuit analysis in mice, utilizing video transformers and self-supervised learning to achieve efficient and accurate brain analysis. Wang et al. (2023b) introduced Endo-FM, a foundational model for endoscopy video analysis, achieving good performance in classification, segmentation, and detection tasks through large-scale self-supervised pre-training and a unique spatio-temporal matching strategy. Zhou et al. (2023) presented RETFound, a foundational model for retinal images based on self-supervised learning, which adapts effectively to various disease detection tasks, excelling in both ocular and systemic disease detection and prediction. Hua et al. (2023) proposed PathoDuet, a foundational model for pathology slide analysis covering H&E and IHC images, validating its efficiency and generalization capability in multiple downstream tasks through a new self-supervised learning framework and two proxy tasks. Wang et al. (2023d) introduced a self-supervised learning method based on Volume Fusion and Parallel Convolution and Transformer Network (PCT-Net) for pre-training 3D medical image segmentation, showing excellent performance in multiple downstream segmentation tasks. Jiao et al. (2024) proposed USFM, a universal ultrasound foundational model, by building a large-scale multi-organ, multi-center, multi-device ultrasound database and using a spatial-frequency dual-masking image modeling method for self-supervised pre-training, demonstrating good generality, performance, and label efficiency in various downstream tasks. Kang et al. (2023) introduced Deblurring MAE, a method that incorporates a deblurring task into masked autoencoder (MAE) (He et al., 2022) pre-training, enhancing the ability of MAE to recover details in ultrasound images, thereby improving its performance in ultrasound image recognition tasks. While self-supervised learning (SSL) offers the significant advantage of leveraging vast unlabeled datasets, thereby avoiding costly manual annotation, the features learned via proxy tasks often require substantial fine-tuning to effectively adapt to specific downstream medical tasks, particularly when domain shifts exist (Jiao et al., 2024). Conversely, supervised pre-training (SL) on diverse, labeled datasets like M2-US, despite its annotation cost, directly optimizes representations for the target tasks (e.g., segmentation, classification). For developing a unified framework like PerceptGuide, intended for robust performance across multiple known tasks and organs with potentially minimal downstream adaptation, this direct task alignment learned via SL can provide a more efficient pathway to deployment effectiveness. Furthermore, SSL and supervised approaches like ours are not mutually exclusive; future work could explore leveraging upstream SSL pre-training for rich feature extraction followed by supervised training with PerceptGuide to instill task-specific knowledge and prompt-based orchestration.
According to Zhang and Metaxas (2023), data diversity is one of the key factors in training foundational models. Therefore, some researchers have conducted in-depth studies on dataset construction or generation. Huang et al. (2023a) introduced the A-Eval benchmark for evaluating the cross-dataset generalization capability of abdominal multi-organ segmentation models, emphasizing the importance of data diversity and model size on generalization. Wang et al. (2023c) proposed a new dataset and benchmark, MedFMC, for evaluating foundational models in medical image classification, covering a variety of real-world clinical tasks and different medical image modalities, validating the effectiveness and limitations of some foundational models in medical image classification tasks through experiments. Ding et al. (2023) introduced a large-scale synthetic pathology image dataset, SNOW, for breast cancer segmentation, demonstrating its effectiveness and competitiveness in model training through quality validation, enhancing nuclear segmentation performance. From these works, we understand that training an effective foundational model requires sufficiently diverse data.
Regarding data, some researchers have proposed utilizing text data to construct multimodal visual-text models combining text (reports) and images. Qin et al. (2022) explored how to use large-scale pre-trained visual language models (such as GLIP Li et al., 2022) for medical image understanding, improving model performance in medical image detection and classification tasks by manually designing effective medical prompts and automatically generating medical prompts, showing that pre-trained Vision Language Models (VLMs) can be effectively transferred to the medical domain through prompt learning. Zhang et al. (2023) proposed the CITE framework, which enhances pathology image classification performance by injecting textual knowledge into the foundational model adaptation through linking image and text embeddings, demonstrating strong model extensibility and excellent performance under data-limited conditions. Training both the text encoder and image encoder simultaneously in such models is challenging due to the significant differences in their feature spaces, and these models face the issue of small datasets, as paired medical image-text datasets are currently accumulating and are much less abundant than natural image-text paired datasets. However, incorporating structured semantic information can alleviate this issue to some extent, as it does not rely heavily on large volumes of paired data, thus mitigating the data scarcity problem.
Unlike the approach of self-supervised learning pre-training followed by transfer to downstream tasks (Table 1), another significant approach for general models is to first construct a rich dataset and then conduct large-scale supervised pre-training, allowing the model to be used directly. Wang et al. (2023a) proposed SAM-Med3D, a model that modifies Segment Anything Model (SAM) (Kirillov et al., 2023) with 3D positional encoding for 3D medical image segmentation, displaying excellent performance and generalization on multiple datasets. Cheng et al. (2023) fine-tuned SAM on a large-scale medical image dataset to obtain SAM-Med2D, significantly improving various medical image segmentation tasks and showing excellent performance and generalization in segmenting different anatomical structures, modalities, and organs. Lei et al. (2023) proposed MedLSAM, a fully automated SAM medical adaptation model, including MedLAM for 3D medical image localization and SAM for segmentation, reducing annotation workload and exhibiting good performance. Huang et al. (2023b) designed a series of scalable and transferable medical image segmentation models, STU-Net, demonstrating strong performance and transferability in different downstream tasks through supervised pre-training on a large-scale dataset. Lin et al. (2023) introduced SAMUS and its end-to-end version AutoSAMUS, general models based on SAM for ultrasound image segmentation, achieving better segmentation performance and automatic segmentation capability through improvements such as CNN branches, feature adapters, and positional adapters, as well as an automatic prompt generator (APG). Ma et al. (2024) proposed MedSAM, a foundational model for medical image segmentation, trained and fine-tuned on a large-scale medical image dataset, exhibiting superior performance and generalization in various segmentation tasks compared to existing models. While models like SAM (Kirillov et al., 2023) and its medical adaptations (e.g., SAM-Med2D Cheng et al., 2023, MedSAM Ma et al., 2024) have demonstrated potential through spatial prompts (points, boxes), these approaches primarily focus on localization cues and may lack the semantic richness required to address ultrasound-specific challenges. Ultrasound imaging is inherently complex due to factors such as significant noise, artifacts (e.g., acoustic shadowing, enhancement), and high inter-patient variability in anatomical structures across acquisition conditions.
This approach is practical and can effectively avoid the shortcomings of self-supervised learning design. However, currently, there is no general supervised large-scale pre-training method for the ultrasound modality that can simultaneously solve multiple tasks of classification and segmentation. In multi-task learning, it is common to encounter inconsistencies between task objectives (i.e., conflicts between tasks). Google’s HydraNet (Mullapudi et al., 2018) addresses this issue by sharing underlying features while employing task-specific heads, effectively mitigating such conflicts. This can be seen as an implementation of “co-learning”, where a shared training mechanism enables multiple tasks to optimize their respective objectives simultaneously. However, in contrast to our approach, HydraNet has a notable limitation. It requires separate segmentation and classification decoders (seg/cls decoders) for different organs. For instance, it would need one set of seg/cls decoders for breast images and another for thyroid images. In our proposed framework, we take advantage of the flexibility of prompts. Regardless of the type of organ’s ultrasound image, we can utilize the same seg decoder or the same cls decoder through prompt adaptation. This unique design of network structure reuse in our method allows for greater efficiency and simplicity compared to HydraNet. This paper follows our own approach to conduct research on ultrasound data.
This work represents a significant extension of our previous conference paper UniUSNet (Lin et al., 2024). Notably, this paper extends UniUSNet by introducing a hyper-perception module with prompts and an attention-matching downstream synchronization stage, while expanding the dataset from 7 to 9 organs (9.7k images to 33k images) and adding comprehensive evaluations (including SOTA, thorough ablation and failure case analysis). To address the aforementioned challenges, this paper proposes an orchestration learning framework PerceptGuide for ultrasound classification and segmentation, which incorporates a hyper-perception module and leverages the characteristics of ultrasound prompts to enhance flexibility and tunability. Our contributions can be summarized as follows:
•Comprehensive Dataset Collection: We have compiled a large-scale public ultrasound dataset named M2-US (Multi-position Multi-task). This dataset includes ultrasound images of 9 different organs or body parts, covering a total of 16 datasets (6 classification and 10 segmentation datasets), with over 33,000 images.
•Prompt-Guided Hyper-Perception Orchestration Ultrasound Framework: We propose a general framework for ultrasound image processing that addresses multi-task and multi-organ ultrasound image classification and segmentation. The core of this framework is the hyper-perception module we developed, which introduces model prompts. This not only enhances the model’s flexibility but also incorporates prior knowledge. We designed four specific prompts based on the intrinsic properties of ultrasound images: Object, Task, Input, and Position.
•Attention-Matching Downstream Synchronization Stage: For the extension purpose, we introduce a downstream synchronization training stage that effectively fine-tunes our hyper-perception module as an adapter for new data, thereby improving themodel’s generalization capability.
The rest of the paper is organized as follows: In Section 2, we detail our methodology, including the categories of model prompts, the hyper-perception module, and the downstream synchronization stage. Section 3 covers our experimental protocol, where the M2-US dataset is described in detail, including data preprocessing, implementation details, and evaluation metrics. In Section 4, we present our experimental results and corresponding analysis and discussion. Finally, Section 5 concludes the whole paper.
Comments (0)