Introduction
Federated Learning (FL) has emerged as a promising approach to harness the power of distributed data across multiple devices while ensuring user privacy [1], [2]. This decentralized paradigm has found notable applications in the Internet of Things (IoT), particularly in areas like Human Activity Recognition (HAR), where sensitive personal data is prevalent [3], [4], [5]. Complex applications, like HAR, typically require the integration of various data types including images, inertial data, and audio, captured by multiple personal devices. Traditional FL approaches, however, typically focused on data from a single, homogeneous data modality [6], [7], referred to as unimodal FL. However, the proliferation of smart IoT devices has led to the production of data from diverse modalities, necessitating a shift towards Multimodal FL (MFL), where participants possess more than one data modality in their local datasets. While MFL offers improved accuracy and enhanced robustness against missing data, its application in IoT networks presents significant challenges, particularly in terms of managing data heterogeneity across devices and addressing the variability in system performance.
MFL over heterogeneous IoT networks faces two core challenges: data heterogeneity and system heterogeneity. Data heterogeneity arises from variations in local datasets, both in statistical properties and the types of data modalities available on IoT devices (i.e., modality heterogeneity). For instance, devices might record the same activity in vastly different ways, resulting in non-Identically Independent Distributed (non-IID) data, or they may capture entirely different modalities, such as one device recording audio while another captures images. These disparities complicate development of unified, accurate multi-modal models. In addition, system heterogeneity refers to the differences in communication and computational capabilities between devices. Some devices, especially those handling more complex data types, may require longer processing times, leading to delays that compromise the overall efficiency of the training process. Addressing these challenges is essential for the scalability and efficiency of MFL over heterogeneous IoT systems.
MFL requires efficient utilization of available communication and computation resources distributed throughout the wireless network, while addressing the heterogeneity of multi-modal data residing on IoT devices. Traditional cellular networks, which rely on centralized base stations, struggle to meet the demands of MFL, particularly in heterogeneous environments where devices experience varying communication quality. This limitation is especially pronounced at the cell edges, where devices often suffer from higher delays due to weaker channel quality. A promising solution lies in the use of Cell-Free Massive MIMO (CF-mMIMO) systems, which distribute multiple access points (APs) throughout a network, replacing the single base station in traditional cellular networks. By providing uniform coverage and reducing communication bottlenecks, CF-mMIMO ensures consistent service quality even in scenarios with highly heterogeneous devices. This distributed network architecture is particularly well-suited to the needs of MFL, as it alleviates the system heterogeneity that can otherwise slow down collaborative model training.
In this paper, we address two fundamental research questions: [1] How can multi-modal data from heterogeneous IoT devices be effectively fused in MFL without a significant reduction in performance, particularly in the presence of missing modalities? [2] How can resource allocation and device-modality selection be optimized in resource-limited CF-mMIMO networks to minimize training delays and global loss of MFL? To answer these questions, we propose a late-fusion model architecture that allows for flexible client participation across varying data modalities, combined with a novel optimization scheme for resource allocation. Through extensive experiments, we demonstrate that our framework significantly reduces training time while maintaining model performance, even with missing modalities during inference.
Our work strategically addresses the challenges of data heterogeneity and system heterogeneity of MFL with the following contributions:
Late-fusion of Multi-modal Data: We employ a late-fusion model architecture that supports flexible client participation with any subset of data modalities in the MFL process. The proposed model architecture allows device-modality selection, making our scheme compatible to complex settings where participants possess dissimilar data modalities in the training phase, and maintaining robustness to missing modalities.
Performance Optimization of MFL: To improve the performance of MFL over CF-mMIMO networks, we formulate an optimization problem to jointly minimize the execution delay and global loss while taking into account communication and computation limitations. The latter objective function is converted to a device-modality selection term based on the importance of multi-modal local datasets. Furthermore, we propose a greedy algorithm for prioritized device-modality selection, while taking into account the quality of local updates and the communication channel. Then, we employ a modified Particle Swarm Optimization (PSO) algorithm to optimize the communication-computation resources within th e CF-mMIMO network.
Experimental Validation: We show the effectiveness of the proposed framework with extensive experiments on the illustrative application of HAR. Simulation results show compatibility of the late-fusion model with different setups of MFL as well as robustness to missing modalities in the inference phase. In addition, our device-modality selection and resource allocation algorithms demonstrate improvements in terms of training delay and model performance.
The important notations of the paper is summarized in table 1. The rest of this paper is as follows: Section II, reviews the existing works. Section III provides an overview of MFL over CF-mMIMO. In Section IV, we formulate the optimization problem for device-modality selection scheme and resource allocation, and present our proposed solutions. Section V includes the experimentation to evaluate our algorithm. Finally, Section VI concludes our work.
Related Works
In this section, we review the recent advancements in MFL over resource-limited wireless networks focusing on integration of multimodal data with the application of HAR.
A. Multi-Modal Federated Learning
A primary challenge in dealing with multi-modal data is learning and fusion of representations for different data modalities [8]. Data fusion can generally be divided into fusion with raw modalities (early fusion) and fusion with abstract modalities (late fusion) [9], [10]. In the former, raw features from different modalities are combined, e.g., with concatenation. In late fusion, decisions made based on each modality are fused to make a final inference. Considering modality heterogeneity, MFL can be categorized into i) homogeneous (Homo-MFL), and ii) heterogeneous (Hetero-MFL). In Homo-MFL, users have similar data modalities with disjoint ID spaces across users. In Hetero-MFL, a more practical setting for HAR, users possess dissimilar combination of data modalities, and do not share the same ID spaces. Several works have studied the Homo-MFL process [11], [12], [13]. Novel techniques have been proposed in [14] and [15] for MFL with image and text data modalities. The proposed algorithm in [16] introduces inter-modal and intra-modal contrastive objectives to tackle the modality and task gap challenges. Authors in [17] consider the early fusion of data modalities in a feature-disentangled network with five modules for the HAR classification task. The extracted features are fed into separate classifiers: a private classifier for users with same modalities, and a shared classifier that is used among all users. Zhao et. al [18] consider MFL over IoT data where local end-to-end auto-encoders are trained for each modality based on FL. The multi-modal FedAvg is used to aggregate the local auto-encoders on unlabeled data. Then, using auxiliary data over the server, classifiers are trained for each user.
B. FL Over CF-mMIMO Networks
While some studies enhance CF-mMIMO networks using FL through private pre-coding [19], clustering in multi-cast routing [20], and associating users with APs [21], others focus on leveraging CF-mMIMO to enhance FL. The first study of FL over CF-mMIMO systems was done in [22] where an optimization problem was formulated to minimize the FL completion time by jointly optimizing the transmission power, data rate, and local processing frequency. Authors in [23] design a resource allocation algorithm for FL in CF-mMIMO systems aiming to minimize the execution time. To this end, it formulates an optimization problem to select the users with favorable channel conditions while jointly optimizing the transmit power and processing frequency. Authors in [24] proposed an energy-efficient FL framework for cell-free IoT networks by minimizing energy consumption by optimizing CPU frequency and power allocation to reduce the straggler effect. Article [25] employed the quantization error in analog-to-digital converters and digital-to-analog converters to support privacy in CF-mMIMO systems. In addition, it proposed an asynchronous scheme to mitigate the straggler effect in CF-mMIMO networks. Reference [26] formulates the problem of minimizing the loss function in FL which is then converted to a tractable closed-form problem by substituting it with error rate function. It uses fractional programming to optimize the transmission power to minimize the packet error rate. Lastly, reference [27] proposes a practical implementation of over-the-air FL in CF-mMIMO networks along with analytical and experimental studies supporting the benefit of CF-mMIMO networks for over-the-air learning.
The majority of existing works [24], [25], [27] primarily focus on minimizing communication costs, often overlooking critical factors such as the number of participants and the type of data modalities involved in the FL process. In this context, the user selection scheme proposed in [23] prioritizes participants based on their channel quality. Conversely, [28] represents the first study to address both user and modality selection, although it fails to consider the communication channel and the available communication resources.
Multi-Modal Federated Learning Over CF-mMIMO Networks
This section overviews MFL in Section III-A and CF-mMIMO systems in Section III-B, respectively, discussing their challenges in connection to data and system heterogeneity.
A. Multi-Modal Federated Learning
MFL refers to the collaboration of several users who possess multiple data modalities to minimize a global function without sharing the raw data [29], [30]. Fig. 1 shows the general MFL system model for a set of K users indexed by
Snapshot of MFL over a CF-mMIMO system, where users posses different combinations of data modalities.
1) General MFL Framework
The proposed MFL framework includes several communication rounds, denoted by
2) Distributed Multi-Modal Data
Let
3) Deep Learning Model Design
We aim to design a structure where each user k with any arbitrary combination of data modalities
Late-fusion network architecture of the global model, where each modality is associated with a sub-model. Users possess sub-models corresponding to their available data modalities.
4) Loss Function
The global objective in MFL is to minimize the weighted sum of loss value over the local datasets without direct access to the private data. In this regard, the following optimization problem is minimized:\begin{equation*} \underset {\boldsymbol {W}({\mathcal {M}})}{\min } \; \; {\mathcal {L}}_{G}(\boldsymbol {W}({\mathcal {M}})) = \sum _{k \in \mathbb {U}}\pi _{k}{\mathcal {L}}_{k}({\mathcal {D}}_{k},\boldsymbol {W}({\mathcal {M}}_{k})), \tag {1}\end{equation*}
\begin{align*} \underset {\boldsymbol {W}_{k}}{\min } \; \; {\mathcal {L}}_{k}({\mathcal {D}}_{k},\boldsymbol {W}({\mathcal {M}}_{k})) = \sum _{i = 1}^{|{\mathcal {D}}_{k}|} \ell \Big (F_{k}\big (\boldsymbol {x}_{ki},\boldsymbol {W}({\mathcal {M}}_{k})\big ),y_{ki}\Big ), \tag {2}\end{align*}
At each communication round r, the global weight parameters of the multi-modal model are updated at the central server as follows:\begin{align*} \boldsymbol {W}^{r}({\mathcal {M}}) = \boldsymbol {W}^{r-1}({\mathcal {M}}) + \sum _{k\in \mathbb {U}}\begin{bmatrix} \boldsymbol {1} \times p_{k,1} \\ \vdots \\ \boldsymbol {1} \times p_{k,M} \end{bmatrix} \odot \boldsymbol {G}^{r}_{k}({\mathcal {M}}), \tag {3}\end{align*}
Considering the above-mentioned framework, we focus on the problem of MFL over resource-limited CF-mMIMO networks. In this context, we introduce the communication and computation models in the following subsection.
B. Overview of CF-mMIMO Systems
Let us assume that multiple APs, indexed by
1) Intra-Zone Communication
Devices belonging to user k are allowed to share raw data within the private zone
2) Inter-Zone Communication
Devices within a private zone communicate with other parties in the network to train a global model based on the FL algorithm. In this regard, we consider a user-centric setting where each user k connects with a set of APs, denoted as
The device-AP association in CF-mMIMO networks can be applied based on various metrics, e.g., distance of the devices from the APs or the quality of the channel between devices and APs. We consider a Rayleigh fading channel model between APs and devices. Accordingly, the channel between device k and AP a is generated as \begin{equation*} {\mathcal {A}}(k) = \{k \; \; s.t. \; \; ||\boldsymbol {h}_{ka}||_{2} \geq h^{\text {thr}} \}, \tag {4}\end{equation*}
3) Delay Model
Data modalities may produce heterogeneous delays due to their specific characteristics, the employed network architecture, and the amount of computations required for local updates. The delay model of each user k at communication round r is given as\begin{equation*} t_{k}^{r} = d_{k}^{\text {ul},r} + d_{k}^{\text {comp},r} + d_{k}^{\text {dl},r}, k = 1,\ldots ,K, \tag {5}\end{equation*}
a: Uplink Transmission
we employ Non-Orthogonal Multiple Access (NOMA) for uplink transmission mode as it allows participating devices to transmit within the same frequency-time blocks. Let
The CU employs the SIC algorithm to decode the received model updates \begin{equation*} d_{k}^{\text {ul}} = \frac {\sum _{m=1}^{M} g_{k,m}v^{r}_{k,m}}{r_{k}^{\text {ul}}}, \tag {6}\end{equation*}
b: Computations Delay
Denoting \begin{equation*} d_{k}^{\text {comp}} = \frac {\sum _{m=1}^{M} C_{k,m}\psi _{k,m}v^{r}_{k,m}}{f^{r}_{k}}, \tag {7}\end{equation*}
c: Downlink Transmission
In the downlink transmission, i.e, stage (S5), where APs send the updated models to the users, we employ OMA scheme. Accordingly, sub-models associated with different data modalities are assigned with orthogonal resources. In this regard, OMA ensures that the transmitted downlink messages are tailored to specific modalities, indicating that users with identical data modalities receive similar models. Consequently, as the number of selected users at each round (i.e.,
d: Energy Consumption
The energy consumption at each user \begin{equation*} e_{k}^{r} = e_{k}^{r,\text {comp}} + e_{k}^{r,\text {ul}}, \tag {8}\end{equation*}
\begin{equation*} e_{k}^{r,\text {comp}} = \epsilon _{k}(f_{k}^{r})^{2} \sum _{m=1}^{M} C_{k,m}\psi _{k,m}v^{r}_{k,m}. \tag {9}\end{equation*}
Device-Modality Selection and Resource Allocation for MFL in CF-mMIMO Systems
In this section, we first formulate the decision-making problem in MFL over resource-limited CF-mMIMO networks, corresponding to the (S1) stage of the MFL framework outlined in Section III-A.1. To address this problem and improve the performance of MFL, we subsequently propose a device-modality selection and resource allocation algorithm.
A. Problem Formulation
To train the multi-modal network structure in Fig. 2, we break it into M sub-models. Accordingly, each device k takes part in training its available data modalities
In the proposed MFL framework, employing the information collected from the network, the server selects participating users, assigns data modalities for updates, and allocates communication resources for uplink transmission. In this regard, after completion of each round of MFL, the available sub-models at each user k are fused to construct local multi-modal networks. In the MFL process, the M sub-models are trained simultaneously considering the communication and computation limitations. Let \begin{equation*} T(\boldsymbol {S}, R, \boldsymbol {a}) = \sum _{r \in \mathbb {R}} T_{r}(\boldsymbol {s}_{r}, \boldsymbol {a}_{r}), \tag {10}\end{equation*}
\begin{equation*} T_{r}(\boldsymbol {s}_{r},\boldsymbol {a}_{r}) = \underset {k}{\max } \; \{ \nu ^{r}_{k}t_{k}^{r} \}, \tag {11}\end{equation*}
The goal of MFL is to train a model that minimizes the global loss function \begin{equation*} {\mathcal {L}}(\boldsymbol {S},R,\boldsymbol {a}) = {\mathcal {L}}_{G}(\boldsymbol {W}^{(R)}({\mathcal {M}})). \tag {12}\end{equation*}
Considering the objective functions in (10) and (12), the following optimization problem is defined for performance improvement in MFL:\begin{align*} & \boldsymbol {{\mathcal {P}}}1) \; \; \underset {\boldsymbol {a}}{\min }\quad \{ T(\boldsymbol {S},R,\boldsymbol {a}), {\mathcal {L}}(\boldsymbol {S},R,\boldsymbol {a})\} \tag {13}\\ & \hspace {0.8cm}\textbf {s.t.} \; \; \mathbf {C1.} \ \ v_{k,m}^{r}\in \{0,1\}, \; \forall k \in \mathbb {U}, \; \forall m \in {\mathcal {M}}, \; \forall r\in \mathbb {R}, \tag {14}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }\mathbf {C2.} \ \ t^{r}_{k} \leq \tau _{\text {th}}, \; \forall k \in \mathbb {U}_{s}^{r}, \; \forall r \in \mathbb {R}, \tag {15}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }\mathbf {C3.} \ \ e_{k}^{r} \leq e_{k}^{\max }, \; \forall k \in \mathbb {U}_{s}^{r}, \; \forall r \in \mathbb {R}, \tag {16}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }\mathbf {C4.} \ \ 0 \leq p_{k}^{r} \leq p_{k}^{\max }, \; \forall k \in \mathbb {U}_{s}^{r}, \; \forall r \in \mathbb {R}, \tag {17}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }{\mathbf {C5.} \ \ f_{k}^{\min } \leq f_{k}^{r} \leq f_{k}^{\max }, \; \forall k \in \mathbb {U}_{s}^{r}, \; \forall r \in \mathbb {R}}, \tag {18}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }\mathbf {C6.} \ \ \sum _{\forall k \in \mathbb {U}}\nu ^{r}_{k} \geq N^{\text {QoL}}, \; \forall r \in \mathbb {R}, \tag {19}\end{align*}
B. Proposed Solution
Achieving an optimal solution for optimization problem
1) Proposed Device-Modality Selection
Recent works [32], [33] have shown that the participation of a greater number of users can improve the convergence speed in the FL process. However, due to limited communication resources in CF-mMIMO networks, only a subset of devices can be selected in each communication round since a large number of participants leads to significant delays (i.e., straggler effect [34]). In addition, device-modality pairs contribute differently to training the multi-modal model in Fig. 2 due to system and data heterogeneity. Therefore, it is imperative to prioritize devices and modalities that make the most contribution to the learning process. One approach to achieve this is to consider a Single-class Queue (SingleQ), where all device-modality pairs are considered in a single queue and prioritized based on a unified metric. In this case the global loss function can be estimated as
In our proposed selection scheme, device-modality pairs are modeled as a Multi-class Queue (multiQ) where priority is given based on modality-specific metrics at each queue. Accordingly, let \begin{align*} & \hspace {-.5cm}\boldsymbol {{\mathcal {P}}}^{\text {DMS}}_{\text {MultiQ}}) \; \; \underset {\boldsymbol {v}^{r}}{\min }\quad L_{1},\ldots ,L_{M} \tag {20}\\ & \hspace {0.9cm}\textbf {s.t.} \; \; \mathbf {C1} \sim \mathbf {C6} \\ & \hphantom {\hspace {0.9cm}\textbf {s.t.} \; \; }\mathbf {C7)} \; -\sum _{k = 1}^{K} \zeta ^{r}_{k,m}v^{r}_{k,m} \geq N_{m}^{\text {QoL}}, \; \forall m \in {\mathcal {M}} \tag {21}\end{align*}
Algorithm 1 Proposed MultiQ Algorithm for Device-Modality Selection
The set of all users
Construct
for
end for
while
end while
function Select-Check(k,m)
Update
Compute
Compute
if
end if
return
The selected device-modality vectors
2) Proposed Resource Allocation
Although the device-modality selection pairs are fixed during a communication round, the communication channel may change multiple times in that period. Hence, the resources need to be tuned accordingly. Obtaining the device-modality selection from sub-problem 1, the resource allocation optimization problem at communication round r can be rewritten as follows:\begin{align*} & \hspace {-.5cm}\boldsymbol {{\mathcal {P}}}^{\text {RA}})\qquad \; \underset {\boldsymbol {p},\boldsymbol {f}}{\min }\quad T_{r}(\boldsymbol {s}_{r},\boldsymbol {a}_{r}) = \underset {k}{\max }\{ \nu _{k} t^{r}_{k}\} \tag {22}\\ & \hspace {0.9cm}\textbf {s.t.} \; \; \mathbf {C2} \sim \mathbf {C5}.\end{align*}
Compared to the initial problem in \begin{align*} \boldsymbol {\chi }^{(j)}(t+1) & = \boldsymbol {\chi }^{(j)}(t) + \boldsymbol {\delta }^{(j)}(t+1) \tag {23}\\ \boldsymbol {\delta }^{(j)}(t+1) & = \kappa \boldsymbol {\delta }^{(j)}(t) + r_{1}c_{1}\big (\boldsymbol {\chi }^{(j)}_{\text {best}}(t) - \boldsymbol {\chi }^{(j)}(t)\big ) \\ & \quad + r_{2}c_{2}\big (\boldsymbol {\chi }^{\text {swarm}}_{\text {best}}(t) - \boldsymbol {\chi }^{(j)}(t)\big ), \tag {24}\end{align*}
\begin{align*} \boldsymbol {\chi }^{(j)}_{\text {best}}(t)& = \underset {\boldsymbol {\chi } = \boldsymbol {\chi }^{(j)}(t'), \; t' = 0,\ldots ,t}{\text {arg min}} \; {\mathcal {C}}(\boldsymbol {\chi }), \tag {25}\\ \boldsymbol {\chi }^{\text {swarm}}_{\text {best}}(t)& = \underset {\boldsymbol {\chi } = \boldsymbol {\chi }^{(j)}_{\text {best}}(t), \; \forall j = 1,\ldots ,J }{\text {arg min}} \; {\mathcal {C}}(\boldsymbol {\chi }), \tag {26}\end{align*}
\begin{equation*} {\mathcal {C}}(\boldsymbol {\chi }) = T(\boldsymbol {s}_{r},\boldsymbol {a}_{r})+ \sum _{k\in \mathbb {U}_{s}} \Big ( I(t_{k}^{r}-\tau ^{\text {th}}) + I(e_{k}^{r}-e_{k}^{\max }) + I(p_{k}^{r}-p_{k}^{\max }) + I(-p_{k}^{r}) + I(f_{k}^{r}-f_{k}^{\max }) + I(f_{k}^{\min } - f_{k}^{r}) \Big ) \tag {27}\end{equation*}
Algorithm 2 Proposed PSO Algorithm for Resource Allocation in MFL
Set velocity
Set starting positions to
for i = 1:itr do
for j = 1:J do
Update position using (23)
Update velocity using (24)
end for
Update
Update
update local best position
update global best position
end for
The transmit power
C. Device-Modality Priority Metric
The weight parameters
Probabilistic selection: One common way of selecting data modalities and devices is to select them based on a pre-determined distribution. However, this approach ignores the fact that different devices may have dissimilar data distributions for a specific modality and hence contribute differently to the training process. In addition, cross-modal heterogeneity makes particular modalities more important in the training process.
Gradient-based Selection: Gradients have been widely used in the literature for device selection in unimodal FL. However, comparing the gradients across different modalities is challenging as the associated sub-models might have different number of parameters. In this case, we introduce the following metrics based on the gradient for device-modality selection in MFL:
1) Absolute of Gradients (AbsG)
Absolute of gradients measures the significance of the local updates for a user. In this case, importance weights are formulated as
2) Normalized Absolute of Gradient (NAbsG)
To adapt the AbsG with the multi-modal case, we define the Normalized Absolute of Gradient (NAbsG) as a metric to select device-modality pairs as follows:\begin{equation*} \gamma _{k,m}^{r} = \frac {|\boldsymbol {g}_{k,m}^{r}|}{n_{m}}, \tag {28}\end{equation*}
Simulation Results
In this section, we evaluate the performance of the proposed MFL framework as described in subsection III-A.1. Using HAR as an illustrative application, we first outline the setup of the multi-modal dataset, its distribution across clients, and the communication system configuration shown in Fig. 1. We then introduce performance metrics to analyze the framework from the perspectives of multi-modal machine learning and communications. Finally, we present the evaluation results and discussions, demonstrating the improvements in model performance and training latency achieved with the proposed late-fusion model and decision-making framework for MFL. Specifically, under full client participation, we validate the benefits of the late-fusion model for MFL and demonstrate its robustness to missing modalities during inference, outperforming alternative fusion methods. Furthermore, we show the effectiveness of the proposed device-modality selection in enhancing joint accuracy of sub-models while reducing latency when only a subset of device-modality pairs are selected. Finally, we highlight the effectiveness of CF-mMIMO networks for MFL and evaluate the proposed resource allocation scheme in further reducing latency within resource-limited systems.
A. Simulation Setup
1) Dataset Setup
We used the HARWE dataset [35], which consists of
2) Communication Network
We consider a CF-mMIMO network (see Fig. 1), where
B. Evaluation Metrics
To assess the performance of the proposed MFL framework, we consider the following performance indices.
1) Classification Metrics
The classification test accuracy of the global model, denoted by A, is defined as the number of correctly labeled data samples to the number of misclassified samples in the test set and formulated as \begin{equation*} I_{r}^{\text {ML}} = \bar {A}^{r} \exp \left ({{ -\frac {1}{M}\sum _{m=1}^{M}{(A_{m}^{r} - \bar {A}^{r})^{2}} }}\right ), \tag {29}\end{equation*}
2) Latency
The latency of each communication round is defined as (10). Additionally, we define the average delay desynchronization for round r as follows:\begin{equation*} I_{r}^{\text {desync}} = \frac {1}{K_{s}} \sum _{k\in \mathbb {U}_{s}^{r}} (d_{k}^{r} - T_{r}(\boldsymbol {s}_{r},\boldsymbol {a}_{r})). \tag {30}\end{equation*}
C. Evaluation Results
We present our experiment results in three folds to evaluate the performance of i) the proposed late-fusion structure, comparing it with early-fusion [17] and intermediate-fusion [18] alternatives, ii) the proposed device-modality selection scheme in comparison with existing approaches from the literature [28], [36], [37], [38], [39], and iii) the MFL over CF-mMIMO networks particularly with the proposed PSO power allocation technique.
1) Multi-Modal Data Fusion in FL
Each multi-modal user needs to construct a unified local model that matches its local dataset based on the available data modalities. The updated models are uploaded to the CU for aggregation in each communication round. We evaluate the performance of our proposed late-fusion multi-modal structure (Fig. 2) against baseline models in terms of test accuracy within the MFL framework. Additionally, we analyze its ability to address the challenge of missing modalities during the inference phase compared to alternative approaches. Finally, the computational requirements of the proposed model are compared with those of the baselines. Accordingly, we consider two types of fusion models widely used in the literature, namely, early fusion [17] and intermediate fusion [18]. In the early fusion model structure, raw features of all data modalities are combined before fitting into the neural network. However, intermediate fusion combines extracted features of all modalities and uses them as the input of a joint classifier. Although the proposed late-fusion structure is compatible with scenarios with ID-space and feature-space heterogeneity, the EF and IF models require a) each local dataset to have aligned multi-modal samples, i.e., Homo-MFL, and b) all local datasets have the same data modalities. Accordingly, to allow a comparison between LF and the two alternatives, we consider Homo-MFL in this part of the experiments. In addition, for the sake of fair comparison between different fusion models, we consider a similar neural network structure for all three fusion models. We compare the performance of the three fusion models during both training and inference within the context of MFL.
Fig. 3 illustrates test accuracy versus communication round for the three fusion types when all users are selected in each communication round. As it can be seen, both the early-fusion and intermediate-fusion structures slightly outperform the late-fusion model in terms of accuracy. Particularly, in early and intermediate fusions, the interconnected structure of data modalities is captured resulting in better performance. Table 2 compares the Floating-point Operations Per Second (FLOPS) and the upload size of each model. The proposed late-fusion structure has the lowest FLOPS demanding less computations. In addition, intermediate fusion provides the least model size compared to the other two baselines resulting in less latency on average. However, the main advantage of the late-fusion structure is the degree of freedom in selecting the device-modality pairs that allows (a) flexible participation of modality-specific sub-models, and (b) robustness to missing modalities. This important property makes the late-fusion model an ideal structure for Hetero-MFL scenarios with ID-space and feature-space data heterogeneity.
To evaluate robustness to missing data, we consider the scenario where one or two of the data modalities are missing in the inference phase. In this regard, for early-fusion and intermediate-fusion baselines, reconstruction networks are trained locally to recover the missing data. However, the proposed late-fusion structure does not require a reconstruction network as each sub-model in the network structure can make independent decisions about the activity class labels. The bar charts in Fig. 4 (a) and (b) compare the test accuracy of the three models where one and two data modalities are missing, respectively. As can be seen, the late-fusion structure is more robust to missing data, while the accuracy of early fusion and intermediate fusion models drops with missing data. In fact, the averaging function
Comparison of different fusion models with a) one, and b) two missing modalities in the inference phase. W: watch modality, P: phone modality, and S: speaker modality.
2) Multi-Modal Device Selection
At the start of each communication round, i.e., Stage (S1), the CU decides which device-modality pairs are selected for participation in the current round based on Algorithm 1. Considering the limitations of the communication network, a desired selection scheme picks the device-modality pairs that make the most contribution to the MFL process in each communication round while producing less delay. Accordingly, an intended HAR model maximizes the average accuracy of uni-modal sub-models, and minimizes the gap between various sub-models corresponding to different data modalities resulting in an increased joint accuracy as formulated in (29).. In light of this, we compare our proposed MultiQ device-modality selection schemes with five baselines: i) Full Selection (FS), where all devices and modalities participate in the MFL process [24], [36], [37], [38], ii) Client Selection (CS) in which the whole multi-modal dataset of a selected client (i.e., user) is employed to update the local model in each round [23], [25], [39], iii) Modality-aware (MA) scheme [28], where the combination of size and effect of sub-models is employed as the selection metric, iv) random selection, where devices and modalities are selected randomly in each round, and v) SingleQ, where device-modalitiy pairs are prioritized based on AbsG and NAbsG metrics. Note that, except in the FS scheme, only half of the device-modality pairs are selected at each communication round in the other methods.
Fig. 5 (a) and (b) compare the aforementioned selection schemes in terms of accuracy and latency over a resource-limited wireless network. As it can be seen, the proposed MultiQ-Abs method provides only slightly less accuracy compared to the FS scheme, despite selecting only half of the possible device-modality pairs. It achieves higher joint accuracy compared to the CS and MA schemes, as our proposed MultiQ-AbsG prioritizes device-modality pairs with the greatest impact on the model. In contrast, the CS scheme selects all modalities of a chosen client, often including device-modalities with minimal contribution. Similarly, the MA scheme tends to favor smaller models, which typically have less impact on the overall fusion model, resulting in an increased performance gap among different sub-models. The SingleQ-AbsG scheme produces the least test accuracy since modalities with larger sub-models possess larger gradients, and hence have a higher chance of being selected over those with smaller sub-models. The performance of this selection scheme is enhanced by the NAbs selection metric, where the size of gradients are normalized. Regarding the latency, the FS scheme provides the largest delay due to the high number of participants, while the CS and random selection methods experience less delay since a uniform portion of modalities are selected. Furthermore, the SingleQ-AbsG and SingleQ-NAbs methods exhibit the least delay since they inherently prioritize uploading the smallest gradients to the server. The MA algorithm achieves slightly lower delay than our proposed MultiQ-Abs scheme, as it tends to select smaller sub-models for transmission.
Similar to the unimodal case, statistical data heterogeneity can significantly affect the performance of the trained HAR model by resulting in biased models. This challenge becomes particularly pronounced when local datasets exhibit an imbalanced distribution, which can degrade the overall model performance. In this regard, we consider three major cases of data heterogeneity to illustrate the effect of device-modality selection over model performance: i) non-IID heterogeneity, where users possess non-IID data with the same number of samples, e.g., perform activities at different paces, ii) label heterogeneity, where users possess different activity classes (LabelHet), and iii) sample heterogeneity (SampleHet), where users have uneven numbers of samples. For the sake of fair comparison, we use an equal number of total data samples in the network for all the aforementioned scenarios.
To this end, Fig. 6 shows the accuracy of the MFL for the late-fusion structure in full participation of users (i.e.,
Comparison of test accuracy versus communication round for different number of selected device-modality pairs (full selection:
3) CF-mMIMO Resource Allocation
Overall, system heterogeneity is the main source of communication desynchronization and hence significant delay in MFL. Particularly, latency is affected by three main variables: the number of selected clients, the selected device-modality pairs, and the availability and optimal allocation of communication-computation resources for the selected device-modality set. We conduct experiments to compare the performance of CF-mMIMO network and the proposed resource allocation schemes with conventional MIMO networks in MFL. Replacing the single base station in conventional MIMO systems with multiple APs in CF-mMIMO networks helps reduce the disparity between the channel condition of devices across the network and potentially reduce latency in MFL. Fig. 7 illustrates the average communication channel gain across the network for different number of APs, where an equal number of transmitting antennas across the network (i.e.,
Comparison of channel quality between conventional MIMO and CF-mMIMO networks: a) centralized MIMO, b) CF-mMIMO with four APs, and c) boxplot of average mean and variance of channel across the network versus the number of APs. For fair comparison, the total number of transmitting antennas in the network is set equal to
At each Stage (S1) of a communication round our proposed PSO scheme optimizes the communication and computation resources in problem
Comparison of (a) communication desynchronization and (b) latency versus the number of APs between the proposed PSO-based resource allocation algorithm and baseline with maximum power transmission. In both figures, the total number of antennas
To further validate the advantages of CF-mMIMO and our resource allocation algorithm for MFL, we examine the number of participants involved in the MFL process under a fixed communication round deadline,
Comparison between the average number of participants in the MFL framework versus number of APs for the proposed PSO algorithm and without resource allocation.
Conclusion
Our proposed framework integrates a late-fusion strategy with a device-modality selection method and a resource allocation scheme using a modified PSO algorithm, effectively addressing the data and system heterogeneity challenges in MFL over CF-mMIMO network. By employing late-fusion, our approach ensures flexible and robust model performance even with missing modalities which is crucial for HAR applications. The experimental results demonstrate that our fusion model outperforms baseline fusion methods achieving higher accuracy with 15% and 23% improvement in test accuracy when missing one and two data modalities in the inference phase, respectively. Additionally, the proposed device-modality selection and resource allocation scheme effectively minimizes the disparity between modality-specific sub-models and reduces communication delays in each round. Our results demonstrate the superiority of CF-mMIMO networks over conventional systems in addressing system heterogeneity, achieving reduced completion times for the MFL process. Future work will explore techniques to mitigate modality-specific data heterogeneity for the late-fusion model, and further optimization of resource allocation to improve scalability and efficiency in diverse application scenarios.