Learning contrast and content representations for synthesizing magnetic resonance image of arbitrary contrast

Magnetic Resonance Imaging (MRI) is an indispensable tool in modern medical diagnosis and intervention, offering non-invasive and radiation-free imaging with superior soft-tissue contrast. By employing various pulse sequences during the imaging process, MRI can produce images with different contrasts of the same anatomy. In clinical practice, multiple MRI sequences are often acquired to provide a comprehensive assessment of pathological conditions, enabling precise diagnosis and treatment planning.

However, it is common for a subject to have one or more missing sequences or modalities. This can be attributed to several factors, such as the need to shorten the examination duration, technical issues with hardware or software, or patient motion during scanning that introduces image artifacts and degrades the quality of the acquired sequences. Additionally, patient-specific factors, including claustrophobia or certain medical conditions, may hinder the successful acquisition of specific MRI sequences. Some patients may experience anxiety or discomfort inside the MRI scanner, making it challenging to complete the entire imaging protocol.

Given the prevalence of incomplete data, the synthesis of missing sequences has become a topic of great interest among researchers (Choy et al., 2018, Nensa et al., 2019). MR image synthesis aims to generate missing target sequences using the available source sequences, which can be viewed as a type of cross-domain image translation. The objective is to transform the information captured in the source images into a representation that resembles the targets by learning the underlying relationship between them. This allows for the generation of realistic target sequences that correspond to the sources, enabling clinicians and researchers to obtain a more comprehensive understanding of the anatomical structures or pathologies being studied. Furthermore, synthesizing missing sequences is beneficial for various automatic multi-modal image analysis tasks, which often require all sequences to be available as inputs (Zhao et al., 2020). In summary, MRI sequence synthesis offers multiple advantages. It can potentially reduce scan time and patient discomfort, address acquisition failures without additional appointments, provide alternatives to contrast agents for at-risk patients, enable research with legacy datasets, and facilitate standardization across different acquisition protocols for multi-center studies.

The advent of deep learning has revolutionized MR image synthesis, yielding remarkable results by leveraging powerful neural network architectures and large-scale datasets (Nie et al., 2018, Lee et al., 2019, Yu et al., 2019, Wang et al., 2020a, Wang et al., 2020b, Huang et al., 2022, Zhang et al., 2024, Peng et al., 2024). Most deep learning methods for image synthesis focus on mapping a single input sequence to a single output, following a one-to-one setting. This setup is generally fixed, requiring a specialized model for each input–output combination, which is inefficient when adapting to new tasks. Therefore, enabling synthesis from new input combinations without retraining is desirable for flexible and efficient MR image synthesis. An ideal synthesis framework of such would incorporate multiple inputs into a unified model and dynamically fuse relevant information from all inputs to synthesize diverse outputs, regardless of whether a certain input is available during inference. Implementations such as MM-GAN (Sharma and Hamarneh, 2019), ResViT (Dalmaz et al., 2022), and MMT (Liu et al., 2023) have greatly enhanced the flexibility of deploying MR image generative models.

However, these methods typically assume that all input–output combinations are seen during network training, often implemented via multi-task learning. They employ shared encoders to extract features from multi-contrast inputs. By fusing the encoded representations of the available inputs, the networks can flexibly infer the targets from varying collections of input images. Nevertheless, modeling the mapping between different image contrasts is non-trivial due to substantial domain gaps. The networks must be sufficiently robust and powerful to capture representations for multi-contrast MR images, treating all image sequences or modalities implicitly in the same way. Moreover, a trained network may only be able to map among those contrasts that are seen at least once in the training data, making generalization to new contrasts difficult.

To address these limitations, we propose a novel method to learn Contrast and Content Representations (CCR) for synthesizing MR images across diverse sequences. Our rationale is to explicitly separate the learning of contrast and content when modeling multi-contrast images in the latent space. Specifically, we train an encoder to process all multi-contrast images and simultaneously learn a sequence-specific contrast representation, assisted by text prompts of imaging protocols or parameters. Through a linear operation, we can remove the contrast information from the latent representation of the input source image, leaving only the content representation that captures the anatomical information. By recombining a target contrast representation with the latent content representation, we can synthesize the output image with the contrast corresponding to the desired target sequence.

The success of CCR relies on modeling contrast and content separately in the latent space, enabling their decomposition and recombination. The content representation reflects the anatomical information shared by all sequences of the same subject, while the contrast representation encodes the intensity expression of this anatomical information in different sequences. To achieve a consistent and efficient representation of the anatomical information, we impose two key constraints: (1) an orthogonality constraint on the content representation channel-wise in the latent space, which encourages each channel to encode unique and non-redundant anatomical information, thus enhancing the representation capacity; and (2) a spatial alignment constraint that ensures consistent encoding of anatomical structures across multiple sequences of the same subject, thereby promoting robustness to different input contrasts.

The CCR framework enables us to flexibly disentangle contrast and content, facilitating various image synthesis tasks with high fidelity. First, we naturally support multiple inputs to synthesize the targets, as multi-contrast input images can be efficiently fused in the latent space after their individual contrast representations are separated from their shared content representation. Second, our model can potentially generate images of new contrasts by assigning the imaging parameters that are even unseen during training as the target for inference. This zero-shot generalization capability distinguishes CCR from many existing methods and opens up new possibilities for synthesizing MR images with novel contrast settings.

The main contributions of this work are summarized as follows:

We propose CCR, a novel framework that explicitly models contrast and content representations for flexible and efficient MR image synthesis.

Our contrast encoder learns sequence-specific contrast representations initiated from imaging parameters, enabling zero-shot generation of unseen MR image contrasts.

By imposing constraints on the encoded content, we construct a content representation that demonstrates high fidelity and interpretability for MR image synthesis.

The rest of this paper is organized as follows. In Section 2, we review the related work in the literature. Section 3 provides a detailed description of our proposed CCR method. We conduct experimental validation on both in-house and public datasets in Section 4. Finally, we conclude this paper with discussions in Section 5.

Comments (0)

No login
gif