The improvement of hospital informatization has led to the proliferation of electronic medical records (EMRs), which gradually meet the data requirements for building artificial intelligence (AI) models [1,2]. Patient outcome prediction models assist clinicians in assessing the severity of a patient's condition and evaluating the impact of various decisions on the patient's outcome, thereby supporting clinical decision-making [3,4]. However, one significant barrier to the widespread adoption of deep learning models is their limited performance [5]. The fusion of multi-modal data represents an effective strategy for enhancing model performance in clinical tasks [6]. EMRs completely record patients’ long-term, multi-modal medical information, including laboratory tests, surgical procedures, bioelectrical signals, various types of images, and textual reports [[7], [8], [9]]. Multi-modal EMR data provide a full range of information to help AI models better understand a patient’s disease state and predict disease trends [10,11]. However, due to the sparse features and heterogeneity of EMR data, AI still faces challenges in effectively fusing multi-modal data [12]. Structured tabular data are the easiest clinical data to acquire and process, mainly including the demographic and laboratory test data. These data are simple in form but rich in information and serve as the most fundamental and effective data for multi-modal fusion [13]. Unstructured text, such as admission reports, medical records, and imaging reports, accounts for the largest proportion of EMR [14]. Encoders for text have been developed to an advanced level, and additional training can enable them to work with medical text [15,16]. Against this backdrop, the fusion of tabular data and clinical text has been the focus of multi-modal fusion research. Researchers have accomplished the fusion of these two modalities using methods such as graph neural networks, transformers, and integration, which have been applied to tasks such as disease risk prediction and surgical outcome prediction [[17], [18], [19]]. Although these studies have been successful in their respective tasks, there remains room for further improvement.
While multi-modal data can reflect comprehensive patient information, these data are not static. Patients undergo multiple examinations during their hospitalization, generating time series of data that changes over time. These data imply the patient's disease progression and treatment course, providing richer information. Time-varying tabular data primarily come from laboratory tests, where multiple test results form a time series that responds to changes in the patient's condition. Textual modalities, such as imaging reports and daily ward round records, also generate additional data during hospitalization, forming time-varying text sequences. Hidden Markov models, convolutional neural networks (CNN), recurrent neural networks (RNN), and transformers have all been applied to time-varying data modelling [[20], [21], [22], [23]]. However, current studies on the fusion of two modalities often only form the time-varying tabular data as time series [4,[24], [25], [26]], and it is more difficult to fuse time-varying texts. To fuse two time-varying modalities, the time series of each modality can be encoded first, followed by the integration of the two modality representation vectors, which may weaken the interaction between modalities at the same time point [27]. Alternatively, the multi-modal information at each time point can be integrated first, and the time series information processed afterwards, which may weaken each modality’s own contextual information [28]. By temporally encoding each modality sequence before fusion, it is possible to perform time-unit fusion without losing the internal temporal information of each modality. Clinical daily ward round records, as long-text data generated by hospitalized patients on a daily basis, contain a wealth of information regarding patient status and status changes. However, their application presents significant challenges due to two key characteristics: a long and dense timeline recorded daily throughout the hospitalization period, and the extensive volume of text with low information density. Effective modeling with this data would expand the data sources available for clinical prediction models, improve performance, and facilitate the clinical application of deep learning models.
During an extended length of hospital stay, data from each moment can reflect the patient’s condition. This phenomenon, where the current state is affected by previous states over long intervals, is known as long-term dependency. However, the substantial amount of information contained in multi-modal medical data may be forgotten by the model over time. Capturing long-term dependencies in extended sequences serves as a critical method for enhancing model performance. Long short-term memory (LSTM) networks and transformers are considered to be two types of models with decent long-term memory. The former is particularly effective at capturing contextual information, while the latter excels at extracting long-sequence features [29,30]. The stacking of LSTM and transformer can leverage the strengths of both architectures, enabling performance improvements [23,31,32]. This indicates that integrating the two models holds significant potential. Additionally, temporal convolutional network (TCN)[33] is also proficient at capturing long-term dependencies, but is more effective than LSTM at extracting local sequence features. Thus, their combination with LSTM offers promising potential [34,35].
To address the challenges of long time series and long text in multi-modal fusion, this study proposes a multi-modal temporal data fusion architecture, medical multi-modal fusion for long-term dependencies (MMF-LD). Time-varying and time-invariant, tabular and textual data are embedded as feature vectors according to their respective characteristics. The time-varying data representations from each modality are encoded by LSTM and fused at each time point, which preserving interactions between the modalities without sacrificing temporal information. Additionally, the progressive multi-modal fusion (PMF) approach repeat the daily notes-guided information interaction, thus preventing the loss of information. During this process, the long short-term storage memory (LSTsM) is used to encode the temporal information of the time-varying fusion representations. This model incorporates an attention mechanism to improve the gated units, thereby enhancing the ability to capture long-term dependencies. Finally, the time-varying and time-invariant fused representations are concatenated and fed into the TCN to complete the final fusion.
Comments (0)