Introduction
In the Intensive Care Unit (ICU), where timely and precise decision-making is crucial, healthcare professionals integrate diverse multimodal data — including clinical observations, diagnostic assays, and imaging studies — to achieve accurate diagnoses [1] [2]. However, in clinical practice, even with the same disease, the importance of each modality often varies among patients [3]. For instance, chest X-rays may provide more direct diagnostic information for some patients, while for others, laboratory test results from clinical records may be more crucial. Recognizing these differences in modality importance among patients enables clinicians to make more tailored and effective medical decisions, thereby enhancing patient outcomes.
In recent years, multimodal fusion in deep learning has demonstrated significant potential for integrating diverse data sources to enhance clinical diagnosis and prediction processes [4] [5]. For instance, intermediate fusion techniques combine features from multiple sources to improve the diagnosis and pathogen prediction for pulmonary infections [6]. While simple methods such as concatenation, element-wise summation, and element-wise multiplication can occasionally outperform unimodal models [7] [8] [9], they are limited in capturing the complex relationships among heterogeneous modalities. Attention-based methods, on the other hand, compute and incorporate importance scores (attention scores) of multimodal features during aggregation, allowing them to learn both inter-modal and intra-modal relationships [10] [11]. In the ICU domain, more realistic temporal datasets have been developed, with certain methods addressing the integration challenges of partially paired, heterogeneous, and temporally asynchronous data [1] [3]. Despite these advancements, existing methods still fall short in addressing the variability in sample-wise modality importance, which can be influenced by confidence levels among different patients (i.e., samples).
To address the issues aforementioned, we propose a dynamic uncertainty-aware weighted strategy that adjusts the sample-wise importance from different modalities based on their confidence. Subsequently, an Expert Ensemble Fusion (EEF) Module incorporates self-attention mechanisms to integrate features from various modalities, combined with modality-specific FeedForward Networks (FFNs) to ensure that critical information from each modality is preserved during the fusion process. The uncertainty-aware weighting strategy and the EEF Module function independently but sequentially: the former enhances the input features, while the latter optimizes their integration. The method is applied to enhance phenotype classification and mortality prediction tasks. Our key contributions are as follows: (1) We design an uncertainty-aware dynamic weighted strategy with an Expert Ensemble Fusion (EEF) Module for ICU clinical prediction tasks, which dynamically adjusts the weighting of each modality based on its confidence and integrates heterogeneous data sources by leveraging modality-specific experts to capture the complementary information. (2) Our dynamic uncertainty-aware weighting strategy accounts for sample-wise data confidence, enabling the model to adaptively adjust the importance of different modalities based on their reliability for each prediction. (3) Extensive experiments on phenotype classification and mortality prediction tasks have yielded outstanding results.
(a) An overview of our uncertainty-aware dynamic fusion framework. The blue and yellow balls mean sample-wise confidence and loss respectively. (b) The detailed structure of our Expert Ensemble Module.
Method
The joint multimodal fusion framework is designed with an uncertainty-aware dynamic weighted strategy and Expert Ensemble Fusion (EEF) Module, as shown in Fig. 1. hehr and hcxr are extracted from the two pre-trained models for further processing with the proposed method. To dynamically assess the weighting of features from each modality, uncertainty is calculated using energy scores [12], grounded in provable dynamic fusion theory [13]. Finally, the feature representations from each modality, weighted by the confidence derived from energy scores, are passed to the EEF Module for further fusion.
A. Uncertainty-Aware Dynamic Weighted Strategy
Inspired by the provable dynamic fusion theory [13], which elucidates the relationship between the generalization ability of dynamic fusion and the performance of uncertainty estimation, our objective is to obtain uncertainty-aware dynamic weights by satisfying the following relationship:
\begin{equation*}r\left( {{w_m},l\left( {{f_m}} \right)} \right) \leq 0\tag{1}\end{equation*}
We employ uncertainty assessment to weight the features from each modality, which can be expressed as:
\begin{equation*}{w_m} = {\alpha _m}{u_m} + {\beta _m}\tag{2}\end{equation*}
In our implementation, m belongs to the set comprising of two modalities including Chest X-Ray (CXR) and time series Electronic Health Records (EHR), representing as:
\begin{equation*}m \in \{ cxr,ehr\} \tag{3}\end{equation*}
The energy score [12] is used for reliable uncertainty estimation, effectively linking the Helmholtz free energy of a data point with its corresponding probability density. To compute the energy score, the output features from the pre-trained unimodal encoders fm, excluding the classifier, are first extracted. The cxr features are then projected to match the dimensionality of the ehr features using a projection layer ϕ:
\begin{equation*}{h_{ehr}} = {f_{ehr}}\left( {{x_{ehr}}} \right),{h_{cxr}} = \phi \left( {{f_{cxr}}\left( {{x_{cxr}}} \right)} \right)\tag{4}\end{equation*}
The energy score E for modality m given the input hm is defined as follows:
\begin{equation*}E\left( {{h_m}} \right) = - {{\mathcal{T}}_m}\cdot \log \sum\limits_k^K {{e^{h_m^{(k)}}}} /{{\mathcal{T}}_m}\tag{5}\end{equation*}
To fulfill the condition outlined in the relationship (1), a sampling-based regularization approach is applied, following the method in [13]. This approach leverages sample-wise loss during training as supervisory information. Prior to loss computation, a unimodal linear classifier gm is appended to the sample feature \begin{equation*}\hat y_m^{(i)} = g_m^{{\theta _t}}\left( {h_m^{(i)}} \right)\tag{6}\end{equation*}
\begin{equation*}l_m^{(i)} = \frac{1}{T}\sum\limits_{t = {T_s}}^{{T_s} + T} \ell \left( {{y^{(i)}},\hat y_m^{(i)}} \right)\tag{7}\end{equation*}
During training, the dynamic weighted process is regularized by learning the following relationship:
\begin{equation*}l_m^{(i)} \geq l_m^{(j)} \Leftrightarrow w_m^{(i)} \leq w_m^{(j)}\tag{8}\end{equation*}
Our regularization term can be calculated with margin ranking loss as:
\begin{equation*}{{\mathcal{L}}_{reg}} = \max \left( {0,d\left( {w_m^{(i)},w_m^{(j)}} \right)\left( {l_m^{(i)} - l_m^{(j)}} \right) + \left| {w_m^{(i)} - w_m^{(j)}} \right|} \right)\tag{9}\end{equation*}
\begin{equation*} d\left( {w_m^{(i)},w_m^{(j)}} \right) = {\begin{cases} {1{}if{}w_m^{(i)} > w_m^{(j)},} \\ {0{}if{}w_m^{(i)} = w_m^{(j)},} \\ { - 1{}otherwise.{}} \end{cases}}\tag{10}\end{equation*}
As training progresses, the confidence of each modality dynamically adjusts the feature weights, thereby producing enhanced feature representations h*:
\begin{equation*}h_m^{\ast} = {w_m}\cdot {h_m}\tag{11}\end{equation*}
B. Expert Ensemble Fusion (EEF) Module
We introduce an Expert Ensemble Fusion (EEF) Module, inspired by the Vision Transformer (ViT) architecture [14], which functions as a fusion module to integrate the enhanced features \begin{equation*}h_m^{\ast} = \left[ {e_1^m,e_2^m, \ldots ,e_{{n_p}}^m} \right]\tag{12}\end{equation*}
To obtain a global representation, a learnable embedding \begin{equation*}\begin{array}{ll} {{Z_0}}&{ = \left( {{e_{cls}} + {E_{pos{}}}[0,:]} \right) \oplus \left( {\left\{ {e_j^{ehr}} \right\}_{j = 1}^{{n_p}} + {E_{pos}}[1:,:]} \right)} \\ {}&{ \oplus \left( {\left\{ {e_j^{cxr}} \right\}_{j = 1}^{{n_p}} + {E_{pos{}}}[1:,:]} \right)} \end{array}\tag{13}\end{equation*}
In our Expert Ensemble Fusion (EEF) module, the single multi-layer perceptron (MLP) block of the original Vision Transformer (ViT) is replaced with two distinct MLP-based Feedforward Networks (FFNs): FFNcxr and FFNehr. These FFNs are designed as separate experts to extract complementary information from the cxr and ehr modalities, respectively, inspired by [15]. Specifically, FFNehr processes the initial np + 1 vectors, while FFNcxr handles the subsequent np vectors. The outputs from these two networks are then concatenated to form the output Zl of block l. A linear classifier gfusion subsequently processes the output of the final layer ZL to produce the fused prediction result ŷfusion.
Finally, the total loss is defined as follows to optimize the module:
\begin{equation*}{{\mathcal{L}}_{{\text{overall }}}} = {\mathcal{L}}\left( {y,{{\hat y}_{{\text{cxr }}}}} \right) + {\mathcal{L}}\left( {y,{{\hat y}_{{\text{ehr }}}}} \right) + {\mathcal{L}}\left( {y,{{\hat y}_{{\text{fusion }}}}} \right) + {{\mathcal{L}}_{{\text{reg }}}}\tag{14}\end{equation*}
Experiments
A. Datasets and Benchmark Tasks
The experimental data comes from MIMIC-IV 1 [19] and MIMIC-CXR 2 [20]. MIMIC-IV contains clinical time-series EHR data from ICU wards [21], and MIMIC-CXR contains Chest X-ray images. To ensure a fair comparison and demonstrate the efficacy of our method, we follow the procedure outlined in [1] for data pre-processing and task execution. The tasks and their respective dataset details are outlined below:
Phenotype classification aims to predict whether patients will develop any of 25 types of chronic, mixed, and acute care conditions during their ICU stay (as detailed in Table X). The dataset comprises 47069 cases, with 10804 cases having paired EHR and CXR data, while the remaining cases lack CXR data.
In-hospital mortality prediction aims to estimate the probability of a patient’s death within the first 48 hours of admission. The dataset for this task includes 20797 cases, with 6215 cases having paired data.
For both tasks, the EHR data comprises 17 clinical variables, while the CXR data includes the last chest X-ray image collected during the ICU stay. All datasets were randomly split according to a 70%, 10%, and 20% ratio for training, validating, and testing respectively containing both paired and unpaired datasets. We refer to the entire set of paired and unpaired data as the full dataset, and the all paired data as the paired subset. The evaluation metrics for each task are the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision Recall Curve (AUPRC).
B. Implementation Details
Our method is implemented based on PyTorch framework and runs on an NVIDIA GeForce RTX 3090 Ti GPU (24G).
Following [1], the encoders are pre-trained with a two-stacked LSTM [22] network for clinical time-series data and a ResNet34 [23] for chest X-ray images. We use 2 blocks in the Expert Ensemble Fusion (EEF) module (Fig. 1b) and the head number in Multi-Head Self-Attention is set to 4. All training steps were conducted using the Adam optimizer [24]. Our initial learning rate is 0.00005. The training steps use a learning rate scheduler to dynamically reduce the learning rate if the validation AUROC does not improve for 15 epochs with a batch size of 16. The final results for the test sets are presented, with 95% confidence intervals estimated through 1,000 bootstrap [25] iterations.
C. Comparison with the State-of-the-art Methods
As detailed in Table I, the proposed method is compared with several state-of-the-art approaches, including Early Fusion [16], Joint Fusion [1], MMTM [17], DAFT [18], and MedFuse [1], across various dataset settings. To effectively highlight the superiority of the proposed fusion method, experiments were conducted using the same pre-traininging strategy as in MedFuse [1] to obtain modality-specific encoders, given that the pre-trained models in MedFuse are not publicly available. Except for MMTM and DAFT, all other methods were trained on the full dataset.
The proposed method demonstrates superior performance in all experiments for phenotyping and mortality prediction. Among the methods with pre-trained unimodal encoders, our method outperforms the other two models (i.e. Early [16] and MedFuse [1]) by a large margin on paired subset for the mortality prediction task. When tested on the full dataset, which closely reflects real-world clinical conditions, our proposed approach achieves the best performance across both tasks, with an AUROC of 0.762 and an AUPRC of 0.420 for phenotype classification, and an AUROC of 0.864 and an AUPRC of 0.519 for in-hospital mortality prediction. The experimental results illustrate the effectiveness of incorporating sample-wise data confidence and tailored fusion modules, significantly improving the accuracy of multimodal clinical prediction tasks.
Ablation Study
To assess the impact of our proposed Uncertainty-Aware dynamic Weighting strategy (UAW) and the Expert Ensemble Fusion (EEF) modules, we conducted a series of ablation experiments on the mortality prediction task with both the paired subset and full dataset. The results, presented in Table II, demonstrate that the UAW strategy and EEF module both enhance prediction performance across both test sets.
Conclusion
In this paper, a novel dynamic fusion module is proposed for multimodal clinical prediction tasks involving time-series clinical data and chest X-ray images within the ICU context. The framework incorporates an uncertainty-aware dynamic weighting strategy that assigns greater importance to features from sample-wise modalities with higher confidence, subsequently feeding these features into the tailored fusion model to uncover intrinsic relationships between heterogeneous modalities. Experimental results demonstrate favorable performance across various data conditions.