Conferences >ICASSP 2025 - 2025 IEEE Inter...

Uncertainty-Aware Dynamic Fusion for Multimodal Clinical Prediction Tasks

Abstract:

Multimodal fusion offers significant potential for enhancing medical diagnosis, particularly in the Intensive Care Unit (ICU), where integrating diverse data sources is c...Show More

Metadata

Abstract:

Multimodal fusion offers significant potential for enhancing medical diagnosis, particularly in the Intensive Care Unit (ICU), where integrating diverse data sources is crucial. Traditional static fusion models often fail to account for sample-wise variations in modality importance, which can impact prediction accuracy. To address this issue, we propose a dynamic Uncertainty-Aware Weighting (UAW) strategy that adaptively adjusts the importance of different modalities based on their reliability. This strategy is coupled with an Expert Ensemble Fusion (EEF) module, which leverages self-attention mechanisms and modality-specific FeedForward Networks (FFNs) to preserve and integrate critical information from various modalities. The proposed method demonstrates its efficacy through extensive experiments on phenotype classification and mortality prediction tasks, showing improved accuracy and robustness in handling diverse clinical data.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10889595

Conference Location: Hyderabad, India

Funding Agency:

Contents

SECTION I.

Introduction

In the Intensive Care Unit (ICU), where timely and precise decision-making is crucial, healthcare professionals integrate diverse multimodal data — including clinical observations, diagnostic assays, and imaging studies — to achieve accurate diagnoses [1] [2]. However, in clinical practice, even with the same disease, the importance of each modality often varies among patients [3]. For instance, chest X-rays may provide more direct diagnostic information for some patients, while for others, laboratory test results from clinical records may be more crucial. Recognizing these differences in modality importance among patients enables clinicians to make more tailored and effective medical decisions, thereby enhancing patient outcomes.

In recent years, multimodal fusion in deep learning has demonstrated significant potential for integrating diverse data sources to enhance clinical diagnosis and prediction processes [4] [5]. For instance, intermediate fusion techniques combine features from multiple sources to improve the diagnosis and pathogen prediction for pulmonary infections [6]. While simple methods such as concatenation, element-wise summation, and element-wise multiplication can occasionally outperform unimodal models [7] [8] [9], they are limited in capturing the complex relationships among heterogeneous modalities. Attention-based methods, on the other hand, compute and incorporate importance scores (attention scores) of multimodal features during aggregation, allowing them to learn both inter-modal and intra-modal relationships [10] [11]. In the ICU domain, more realistic temporal datasets have been developed, with certain methods addressing the integration challenges of partially paired, heterogeneous, and temporally asynchronous data [1] [3]. Despite these advancements, existing methods still fall short in addressing the variability in sample-wise modality importance, which can be influenced by confidence levels among different patients (i.e., samples).

To address the issues aforementioned, we propose a dynamic uncertainty-aware weighted strategy that adjusts the sample-wise importance from different modalities based on their confidence. Subsequently, an Expert Ensemble Fusion (EEF) Module incorporates self-attention mechanisms to integrate features from various modalities, combined with modality-specific FeedForward Networks (FFNs) to ensure that critical information from each modality is preserved during the fusion process. The uncertainty-aware weighting strategy and the EEF Module function independently but sequentially: the former enhances the input features, while the latter optimizes their integration. The method is applied to enhance phenotype classification and mortality prediction tasks. Our key contributions are as follows: (1) We design an uncertainty-aware dynamic weighted strategy with an Expert Ensemble Fusion (EEF) Module for ICU clinical prediction tasks, which dynamically adjusts the weighting of each modality based on its confidence and integrates heterogeneous data sources by leveraging modality-specific experts to capture the complementary information. (2) Our dynamic uncertainty-aware weighting strategy accounts for sample-wise data confidence, enabling the model to adaptively adjust the importance of different modalities based on their reliability for each prediction. (3) Extensive experiments on phenotype classification and mortality prediction tasks have yielded outstanding results.

Fig. 1:

(a) An overview of our uncertainty-aware dynamic fusion framework. The blue and yellow balls mean sample-wise confidence and loss respectively. (b) The detailed structure of our Expert Ensemble Module.

Show All

SECTION II.

Method

The joint multimodal fusion framework is designed with an uncertainty-aware dynamic weighted strategy and Expert Ensemble Fusion (EEF) Module, as shown in Fig. 1. h_ehr and h_cxr are extracted from the two pre-trained models for further processing with the proposed method. To dynamically assess the weighting of features from each modality, uncertainty is calculated using energy scores [12], grounded in provable dynamic fusion theory [13]. Finally, the feature representations from each modality, weighted by the confidence derived from energy scores, are passed to the EEF Module for further fusion.

A. Uncertainty-Aware Dynamic Weighted Strategy

Inspired by the provable dynamic fusion theory [13], which elucidates the relationship between the generalization ability of dynamic fusion and the performance of uncertainty estimation, our objective is to obtain uncertainty-aware dynamic weights by satisfying the following relationship:

$\begin{equation*}r\left( {{w_m},l\left( {{f_m}} \right)} \right) \leq 0\tag{1}\end{equation*}$ View Source

where r means the Pearson correlation coefficient, w_m refers to the weight of modality m, and l(f_m) is the modality-specific loss.

We employ uncertainty assessment to weight the features from each modality, which can be expressed as:

$\begin{equation*}{w_m} = {\alpha _m}{u_m} + {\beta _m}\tag{2}\end{equation*}$ View Source

where α_m < 0, β_m ≥ 0 are modality-specific hyperparameters. u_m means the uncertainty of modality m.

In our implementation, m belongs to the set comprising of two modalities including Chest X-Ray (CXR) and time series Electronic Health Records (EHR), representing as:

$\begin{equation*}m \in \{ cxr,ehr\} \tag{3}\end{equation*}$ View Source

The energy score [12] is used for reliable uncertainty estimation, effectively linking the Helmholtz free energy of a data point with its corresponding probability density. To compute the energy score, the output features from the pre-trained unimodal encoders f_m, excluding the classifier, are first extracted. The cxr features are then projected to match the dimensionality of the ehr features using a projection layer ϕ:

$\begin{equation*}{h_{ehr}} = {f_{ehr}}\left( {{x_{ehr}}} \right),{h_{cxr}} = \phi \left( {{f_{cxr}}\left( {{x_{cxr}}} \right)} \right)\tag{4}\end{equation*}$ View Source

The energy score E for modality m given the input h_m is defined as follows:

$\begin{equation*}E\left( {{h_m}} \right) = - {{\mathcal{T}}_m}\cdot \log \sum\limits_k^K {{e^{h_m^{(k)}}}} /{{\mathcal{T}}_m}\tag{5}\end{equation*}$ View Source

where

$h_m^{(k)}$

represents the feature for the k-th class, and

${{\mathcal{T}}^m}$

denotes a temperature parameter. A more uniform distribution of features indicates higher estimated uncertainty, which is inversely related to the energy score. Consequently, the energy score is employed as a confidence metric, effectively functioning as a dynamic weighting factor.

To fulfill the condition outlined in the relationship (1), a sampling-based regularization approach is applied, following the method in [13]. This approach leverages sample-wise loss during training as supervisory information. Prior to loss computation, a unimodal linear classifier g_m is appended to the sample feature $h_m^{(i)}$ of modality m to generate the samplewise predicted outcomes $\hat y_m^{(i)}$ :

$\begin{equation*}\hat y_m^{(i)} = g_m^{{\theta _t}}\left( {h_m^{(i)}} \right)\tag{6}\end{equation*}$ View Source

where θ_t is the parameters on each iteration epoch t. Then the average training loss can be computed as:

$\begin{equation*}l_m^{(i)} = \frac{1}{T}\sum\limits_{t = {T_s}}^{{T_s} + T} \ell \left( {{y^{(i)}},\hat y_m^{(i)}} \right)\tag{7}\end{equation*}$

View Source

where T means the sample times after training T_s − 1 epochs.

During training, the dynamic weighted process is regularized by learning the following relationship:

$\begin{equation*}l_m^{(i)} \geq l_m^{(j)} \Leftrightarrow w_m^{(i)} \leq w_m^{(j)}\tag{8}\end{equation*}$ View Source

Our regularization term can be calculated with margin ranking loss as:

$\begin{equation*}{{\mathcal{L}}_{reg}} = \max \left( {0,d\left( {w_m^{(i)},w_m^{(j)}} \right)\left( {l_m^{(i)} - l_m^{(j)}} \right) + \left| {w_m^{(i)} - w_m^{(j)}} \right|} \right)\tag{9}\end{equation*}$ View Source

where

$\begin{equation*} d\left( {w_m^{(i)},w_m^{(j)}} \right) = {\begin{cases} {1{}if{}w_m^{(i)} > w_m^{(j)},} \\ {0{}if{}w_m^{(i)} = w_m^{(j)},} \\ { - 1{}otherwise.{}} \end{cases}}\tag{10}\end{equation*}$

View Source

As training progresses, the confidence of each modality dynamically adjusts the feature weights, thereby producing enhanced feature representations h^*:

$\begin{equation*}h_m^{\ast} = {w_m}\cdot {h_m}\tag{11}\end{equation*}$ View Source

B. Expert Ensemble Fusion (EEF) Module

We introduce an Expert Ensemble Fusion (EEF) Module, inspired by the Vision Transformer (ViT) architecture [14], which functions as a fusion module to integrate the enhanced features ${h^{\ast}}_m$ from the cxr and ehr modalities, as described in Section II-A. Since these features are already represented as a sequence of vectors, no additional projection layer is required, and they are directly used as input embeddings. For each modality m ∈ {cxr, ehr}, the enhanced feature vector ${h^{\ast}}_m$ is a D-dimensional vector, which is further decomposed into a series of n_p embeddings $\left\{ {e_j^m} \right\}_{j = 1}^{{n_p}}$ . This can be formally represented as:

$\begin{equation*}h_m^{\ast} = \left[ {e_1^m,e_2^m, \ldots ,e_{{n_p}}^m} \right]\tag{12}\end{equation*}$ View Source

To obtain a global representation, a learnable embedding ${e_{cls}}$ is introduced. Additionally, positional encoding ${E_{pos}} \in {{\mathbb{R}}^{\left( {{n_p} + 1} \right) \times D}}$ is incorporated into both the cxr and ehr embeddings respectively. The input sequence Z₀ for the Expert Ensemble Fusion (EEF) module is thus represented as:

$\begin{equation*}\begin{array}{ll} {{Z_0}}&{ = \left( {{e_{cls}} + {E_{pos{}}}[0,:]} \right) \oplus \left( {\left\{ {e_j^{ehr}} \right\}_{j = 1}^{{n_p}} + {E_{pos}}[1:,:]} \right)} \\ {}&{ \oplus \left( {\left\{ {e_j^{cxr}} \right\}_{j = 1}^{{n_p}} + {E_{pos{}}}[1:,:]} \right)} \end{array}\tag{13}\end{equation*}$ View Source

where ⊕ denotes the concatenation operation and

${Z_0} \in {{\mathbb{R}}^{\left( {2{n_p} + 1} \right) \times D}}$

In our Expert Ensemble Fusion (EEF) module, the single multi-layer perceptron (MLP) block of the original Vision Transformer (ViT) is replaced with two distinct MLP-based Feedforward Networks (FFNs): FFN_cxr and FFN_ehr. These FFNs are designed as separate experts to extract complementary information from the cxr and ehr modalities, respectively, inspired by [15]. Specifically, FFN_ehr processes the initial n_p + 1 vectors, while FFN_cxr handles the subsequent n_p vectors. The outputs from these two networks are then concatenated to form the output Z_l of block l. A linear classifier g_fusion subsequently processes the output of the final layer Z_L to produce the fused prediction result ŷ_fusion.

Finally, the total loss is defined as follows to optimize the module:

$\begin{equation*}{{\mathcal{L}}_{{\text{overall }}}} = {\mathcal{L}}\left( {y,{{\hat y}_{{\text{cxr }}}}} \right) + {\mathcal{L}}\left( {y,{{\hat y}_{{\text{ehr }}}}} \right) + {\mathcal{L}}\left( {y,{{\hat y}_{{\text{fusion }}}}} \right) + {{\mathcal{L}}_{{\text{reg }}}}\tag{14}\end{equation*}$ View Source

where

${\mathcal{L}}\left( {y,{{\hat y}_m}} \right)$

means Binary Cross-Entropy loss.

SECTION III.

Experiments

A. Datasets and Benchmark Tasks

The experimental data comes from MIMIC-IV ¹ [19] and MIMIC-CXR ² [20]. MIMIC-IV contains clinical time-series EHR data from ICU wards [21], and MIMIC-CXR contains Chest X-ray images. To ensure a fair comparison and demonstrate the efficacy of our method, we follow the procedure outlined in [1] for data pre-processing and task execution. The tasks and their respective dataset details are outlined below:

Phenotype classification aims to predict whether patients will develop any of 25 types of chronic, mixed, and acute care conditions during their ICU stay (as detailed in Table X). The dataset comprises 47069 cases, with 10804 cases having paired EHR and CXR data, while the remaining cases lack CXR data.
In-hospital mortality prediction aims to estimate the probability of a patient’s death within the first 48 hours of admission. The dataset for this task includes 20797 cases, with 6215 cases having paired data.

For both tasks, the EHR data comprises 17 clinical variables, while the CXR data includes the last chest X-ray image collected during the ICU stay. All datasets were randomly split according to a 70%, 10%, and 20% ratio for training, validating, and testing respectively containing both paired and unpaired datasets. We refer to the entire set of paired and unpaired data as the full dataset, and the all paired data as the paired subset. The evaluation metrics for each task are the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision Recall Curve (AUPRC).

B. Implementation Details

Our method is implemented based on PyTorch framework and runs on an NVIDIA GeForce RTX 3090 Ti GPU (24G).

TABLE I: Comparison with the state-of-the-art methods on two test sets including paired subset and full dataset. We show the AUROC and AUPRC results for our proposed approach and the baseline models on both phenotyping and in-hospital mortality tasks. Best results are shown in bold. * denotes our reproduced results.

Following [1], the encoders are pre-trained with a two-stacked LSTM [22] network for clinical time-series data and a ResNet34 [23] for chest X-ray images. We use 2 blocks in the Expert Ensemble Fusion (EEF) module (Fig. 1b) and the head number in Multi-Head Self-Attention is set to 4. All training steps were conducted using the Adam optimizer [24]. Our initial learning rate is 0.00005. The training steps use a learning rate scheduler to dynamically reduce the learning rate if the validation AUROC does not improve for 15 epochs with a batch size of 16. The final results for the test sets are presented, with 95% confidence intervals estimated through 1,000 bootstrap [25] iterations.

C. Comparison with the State-of-the-art Methods

As detailed in Table I, the proposed method is compared with several state-of-the-art approaches, including Early Fusion [16], Joint Fusion [1], MMTM [17], DAFT [18], and MedFuse [1], across various dataset settings. To effectively highlight the superiority of the proposed fusion method, experiments were conducted using the same pre-traininging strategy as in MedFuse [1] to obtain modality-specific encoders, given that the pre-trained models in MedFuse are not publicly available. Except for MMTM and DAFT, all other methods were trained on the full dataset.

The proposed method demonstrates superior performance in all experiments for phenotyping and mortality prediction. Among the methods with pre-trained unimodal encoders, our method outperforms the other two models (i.e. Early [16] and MedFuse [1]) by a large margin on paired subset for the mortality prediction task. When tested on the full dataset, which closely reflects real-world clinical conditions, our proposed approach achieves the best performance across both tasks, with an AUROC of 0.762 and an AUPRC of 0.420 for phenotype classification, and an AUROC of 0.864 and an AUPRC of 0.519 for in-hospital mortality prediction. The experimental results illustrate the effectiveness of incorporating sample-wise data confidence and tailored fusion modules, significantly improving the accuracy of multimodal clinical prediction tasks.

TABLE II: Ablation study for our proposed uncertainty-aware dynamic weighting strategy (UAW) and the Expert Ensemble Fusion model (EEF) on in-hospital mortality task. Best results are shown in bold.

SECTION D.

Ablation Study

To assess the impact of our proposed Uncertainty-Aware dynamic Weighting strategy (UAW) and the Expert Ensemble Fusion (EEF) modules, we conducted a series of ablation experiments on the mortality prediction task with both the paired subset and full dataset. The results, presented in Table II, demonstrate that the UAW strategy and EEF module both enhance prediction performance across both test sets.

SECTION IV.

Conclusion

In this paper, a novel dynamic fusion module is proposed for multimodal clinical prediction tasks involving time-series clinical data and chest X-ray images within the ICU context. The framework incorporates an uncertainty-aware dynamic weighting strategy that assigns greater importance to features from sample-wise modalities with higher confidence, subsequently feeding these features into the tailored fusion model to uncover intrinsic relationships between heterogeneous modalities. Experimental results demonstrate favorable performance across various data conditions.

References is not available for this document.

Uncertainty-Aware Dynamic Fusion for Multimodal Clinical Prediction Tasks

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Method

A. Uncertainty-Aware Dynamic Weighted Strategy

B. Expert Ensemble Fusion (EEF) Module

Experiments

A. Datasets and Benchmark Tasks

B. Implementation Details

C. Comparison with the State-of-the-art Methods

Ablation Study

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Uncertainty-Aware Dynamic Fusion for Multimodal Clinical Prediction Tasks

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Method

A. Uncertainty-Aware Dynamic Weighted Strategy

B. Expert Ensemble Fusion (EEF) Module

Experiments

A. Datasets and Benchmark Tasks

B. Implementation Details

C. Comparison with the State-of-the-art Methods

Ablation Study

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?