Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/AMS/Regular/Main.js
Multi-Modal Federated Learning Over Cell-Free Massive MIMO Systems for Activity Recognition | IEEE Journals & Magazine | IEEE Xplore

Multi-Modal Federated Learning Over Cell-Free Massive MIMO Systems for Activity Recognition


Snapshot of multimodal federated learning over a CF-mMIMO system. The APs relay model updates from multi-modal users to the CU for aggregation. Each user possesses a comb...

Abstract:

This paper addresses the problem of Multi-modal Federated Learning (MFL) over resource-limited Cell-Free massive MIMO (CF-mMIMO) networks for the application of Human Act...Show More

Abstract:

This paper addresses the problem of Multi-modal Federated Learning (MFL) over resource-limited Cell-Free massive MIMO (CF-mMIMO) networks for the application of Human Activity Recognition (HAR). MFL leverages diverse data modalities across various clients, while the CF-mMIMO network ensures consistent service quality, crucial for collaborative training. The primary challenges of MFL are data heterogeneity, which includes statistical and modality heterogeneity that complicate data fusion, client collaboration, and inference with missing data, and system heterogeneity, where devices with dissimilar modalities experience varied processing and communication delays, increasing overall training latency. To tackle these issues, we propose a late-fusion model architecture that allows flexible client participation with any combination of data modalities, and formulate an optimization problem to jointly minimize latency and global loss in MFL. We propose a prioritized device-modality selection scheme that allows flexible participation of devices. Additionally, we employ a modified Particle Swarm Optimization (PSO) algorithm for efficient resource allocation. Extensive experiments validate our framework, demonstrating substantial reductions in training time and significant improvements in model performance, particularly an average improvement of 15% and 23% in test accuracy compared to the other fusion models when missing one and two modalities in the inference phase.
Snapshot of multimodal federated learning over a CF-mMIMO system. The APs relay model updates from multi-modal users to the CU for aggregation. Each user possesses a comb...
Published in: IEEE Access ( Volume: 13)
Page(s): 40844 - 40858
Date of Publication: 04 March 2025
Electronic ISSN: 2169-3536

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Federated Learning (FL) has emerged as a promising approach to harness the power of distributed data across multiple devices while ensuring user privacy [1], [2]. This decentralized paradigm has found notable applications in the Internet of Things (IoT), particularly in areas like Human Activity Recognition (HAR), where sensitive personal data is prevalent [3], [4], [5]. Complex applications, like HAR, typically require the integration of various data types including images, inertial data, and audio, captured by multiple personal devices. Traditional FL approaches, however, typically focused on data from a single, homogeneous data modality [6], [7], referred to as unimodal FL. However, the proliferation of smart IoT devices has led to the production of data from diverse modalities, necessitating a shift towards Multimodal FL (MFL), where participants possess more than one data modality in their local datasets. While MFL offers improved accuracy and enhanced robustness against missing data, its application in IoT networks presents significant challenges, particularly in terms of managing data heterogeneity across devices and addressing the variability in system performance.

MFL over heterogeneous IoT networks faces two core challenges: data heterogeneity and system heterogeneity. Data heterogeneity arises from variations in local datasets, both in statistical properties and the types of data modalities available on IoT devices (i.e., modality heterogeneity). For instance, devices might record the same activity in vastly different ways, resulting in non-Identically Independent Distributed (non-IID) data, or they may capture entirely different modalities, such as one device recording audio while another captures images. These disparities complicate development of unified, accurate multi-modal models. In addition, system heterogeneity refers to the differences in communication and computational capabilities between devices. Some devices, especially those handling more complex data types, may require longer processing times, leading to delays that compromise the overall efficiency of the training process. Addressing these challenges is essential for the scalability and efficiency of MFL over heterogeneous IoT systems.

MFL requires efficient utilization of available communication and computation resources distributed throughout the wireless network, while addressing the heterogeneity of multi-modal data residing on IoT devices. Traditional cellular networks, which rely on centralized base stations, struggle to meet the demands of MFL, particularly in heterogeneous environments where devices experience varying communication quality. This limitation is especially pronounced at the cell edges, where devices often suffer from higher delays due to weaker channel quality. A promising solution lies in the use of Cell-Free Massive MIMO (CF-mMIMO) systems, which distribute multiple access points (APs) throughout a network, replacing the single base station in traditional cellular networks. By providing uniform coverage and reducing communication bottlenecks, CF-mMIMO ensures consistent service quality even in scenarios with highly heterogeneous devices. This distributed network architecture is particularly well-suited to the needs of MFL, as it alleviates the system heterogeneity that can otherwise slow down collaborative model training.

In this paper, we address two fundamental research questions: [1] How can multi-modal data from heterogeneous IoT devices be effectively fused in MFL without a significant reduction in performance, particularly in the presence of missing modalities? [2] How can resource allocation and device-modality selection be optimized in resource-limited CF-mMIMO networks to minimize training delays and global loss of MFL? To answer these questions, we propose a late-fusion model architecture that allows for flexible client participation across varying data modalities, combined with a novel optimization scheme for resource allocation. Through extensive experiments, we demonstrate that our framework significantly reduces training time while maintaining model performance, even with missing modalities during inference.

Our work strategically addresses the challenges of data heterogeneity and system heterogeneity of MFL with the following contributions:

  • Late-fusion of Multi-modal Data: We employ a late-fusion model architecture that supports flexible client participation with any subset of data modalities in the MFL process. The proposed model architecture allows device-modality selection, making our scheme compatible to complex settings where participants possess dissimilar data modalities in the training phase, and maintaining robustness to missing modalities.

  • Performance Optimization of MFL: To improve the performance of MFL over CF-mMIMO networks, we formulate an optimization problem to jointly minimize the execution delay and global loss while taking into account communication and computation limitations. The latter objective function is converted to a device-modality selection term based on the importance of multi-modal local datasets. Furthermore, we propose a greedy algorithm for prioritized device-modality selection, while taking into account the quality of local updates and the communication channel. Then, we employ a modified Particle Swarm Optimization (PSO) algorithm to optimize the communication-computation resources within th e CF-mMIMO network.

  • Experimental Validation: We show the effectiveness of the proposed framework with extensive experiments on the illustrative application of HAR. Simulation results show compatibility of the late-fusion model with different setups of MFL as well as robustness to missing modalities in the inference phase. In addition, our device-modality selection and resource allocation algorithms demonstrate improvements in terms of training delay and model performance.

The important notations of the paper is summarized in table 1. The rest of this paper is as follows: Section II, reviews the existing works. Section III provides an overview of MFL over CF-mMIMO. In Section IV, we formulate the optimization problem for device-modality selection scheme and resource allocation, and present our proposed solutions. Section V includes the experimentation to evaluate our algorithm. Finally, Section VI concludes our work.

TABLE 1 Important Notations of the Paper
Table 1- Important Notations of the Paper

SECTION II.

Related Works

In this section, we review the recent advancements in MFL over resource-limited wireless networks focusing on integration of multimodal data with the application of HAR.

A. Multi-Modal Federated Learning

A primary challenge in dealing with multi-modal data is learning and fusion of representations for different data modalities [8]. Data fusion can generally be divided into fusion with raw modalities (early fusion) and fusion with abstract modalities (late fusion) [9], [10]. In the former, raw features from different modalities are combined, e.g., with concatenation. In late fusion, decisions made based on each modality are fused to make a final inference. Considering modality heterogeneity, MFL can be categorized into i) homogeneous (Homo-MFL), and ii) heterogeneous (Hetero-MFL). In Homo-MFL, users have similar data modalities with disjoint ID spaces across users. In Hetero-MFL, a more practical setting for HAR, users possess dissimilar combination of data modalities, and do not share the same ID spaces. Several works have studied the Homo-MFL process [11], [12], [13]. Novel techniques have been proposed in [14] and [15] for MFL with image and text data modalities. The proposed algorithm in [16] introduces inter-modal and intra-modal contrastive objectives to tackle the modality and task gap challenges. Authors in [17] consider the early fusion of data modalities in a feature-disentangled network with five modules for the HAR classification task. The extracted features are fed into separate classifiers: a private classifier for users with same modalities, and a shared classifier that is used among all users. Zhao et. al [18] consider MFL over IoT data where local end-to-end auto-encoders are trained for each modality based on FL. The multi-modal FedAvg is used to aggregate the local auto-encoders on unlabeled data. Then, using auxiliary data over the server, classifiers are trained for each user.

B. FL Over CF-mMIMO Networks

While some studies enhance CF-mMIMO networks using FL through private pre-coding [19], clustering in multi-cast routing [20], and associating users with APs [21], others focus on leveraging CF-mMIMO to enhance FL. The first study of FL over CF-mMIMO systems was done in [22] where an optimization problem was formulated to minimize the FL completion time by jointly optimizing the transmission power, data rate, and local processing frequency. Authors in [23] design a resource allocation algorithm for FL in CF-mMIMO systems aiming to minimize the execution time. To this end, it formulates an optimization problem to select the users with favorable channel conditions while jointly optimizing the transmit power and processing frequency. Authors in [24] proposed an energy-efficient FL framework for cell-free IoT networks by minimizing energy consumption by optimizing CPU frequency and power allocation to reduce the straggler effect. Article [25] employed the quantization error in analog-to-digital converters and digital-to-analog converters to support privacy in CF-mMIMO systems. In addition, it proposed an asynchronous scheme to mitigate the straggler effect in CF-mMIMO networks. Reference [26] formulates the problem of minimizing the loss function in FL which is then converted to a tractable closed-form problem by substituting it with error rate function. It uses fractional programming to optimize the transmission power to minimize the packet error rate. Lastly, reference [27] proposes a practical implementation of over-the-air FL in CF-mMIMO networks along with analytical and experimental studies supporting the benefit of CF-mMIMO networks for over-the-air learning.

The majority of existing works [24], [25], [27] primarily focus on minimizing communication costs, often overlooking critical factors such as the number of participants and the type of data modalities involved in the FL process. In this context, the user selection scheme proposed in [23] prioritizes participants based on their channel quality. Conversely, [28] represents the first study to address both user and modality selection, although it fails to consider the communication channel and the available communication resources.

SECTION III.

Multi-Modal Federated Learning Over CF-mMIMO Networks

This section overviews MFL in Section III-A and CF-mMIMO systems in Section III-B, respectively, discussing their challenges in connection to data and system heterogeneity.

A. Multi-Modal Federated Learning

MFL refers to the collaboration of several users who possess multiple data modalities to minimize a global function without sharing the raw data [29], [30]. Fig. 1 shows the general MFL system model for a set of K users indexed by \mathbb {U} = \{1,\ldots ,K \} . We consider a general setting which is compatible with both Homo-MFL and Hetero-MFL.

FIGURE 1. - Snapshot of MFL over a CF-mMIMO system, where users posses different combinations of data modalities.
FIGURE 1.

Snapshot of MFL over a CF-mMIMO system, where users posses different combinations of data modalities.

1) General MFL Framework

The proposed MFL framework includes several communication rounds, denoted by \mathbb {R} = \{1,\ldots ,R\} , where in each round r, sub-models are exchanged between the server and users in five main stages namely, (S1) decision-making, (S2) local update, (S3) model upload, (S4) model aggregation, and (S5) model synchronization. In stage (S1), the server gathers information of the current communication round from the users including their channel conditions, computation power, and available data modalities. Then, the server selects a subset of users \mathbb {U}_{s}^{r} \subset \mathbb {U} each associated with one or multiple specific data modalities to participate in communication round r of the FL process. In the stage (S2), the selected users update the local models based on their local dataset. Next, in (S3) the updated models are sent to the central server using the uplink transmission model. The central server performs aggregation of the received local updates based on, e.g., FedAvg in (S4). Finally, the aggregated model is sent back to the users using the downlink transmission model in (S5).

2) Distributed Multi-Modal Data

Let {\mathcal {M}} = \{1,\ldots ,M\} denote the set of all global data modalities, and {\mathcal {M}}_{k} \subseteq {\mathcal {M}} denote the set of modalities at user k which has cardinality M_{k} = |{\mathcal {M}}_{k}| . Note that in general, we assume {\mathcal {M}}_{k}\neq {\mathcal {M}}_{j}, k\neq j . We denote the local dataset of user k as {\mathcal {D}}_{k} = \{{\mathcal {X}}_{ki},{\mathcal {Y}}_{ki}\}_{i=1}^{|{\mathcal {D}}_{k}|} , where {\mathcal {X}}_{ki}=[{\mathcal {X}}^{(m)}_{ki}]_{m \in {\mathcal {M}}_{k}} and {\mathcal {Y}}_{ki} denote the feature set and the corresponding label of i^{\text {th}} data sample at user k.

3) Deep Learning Model Design

We aim to design a structure where each user k with any arbitrary combination of data modalities {\mathcal {M}}_{k} \subseteq {\mathcal {M}} can participate in training, and employ all its available data modalities to perform a classification task. Accordingly, we propose a late fusion structure (Fig. 2), where the global model is broken into |{\mathcal {M}}| sub-models each corresponding to a data modality. Each sub-model m consists of an encoder and a classifier that map the input data of modality m to the activity class probabilities. The outputs of all sub-models are combined using a late fusion function, denoted by \Psi (\cdot) , for the final inference. Similarly, the neural network at each user k consists of M_{k} = |{\mathcal {M}}_{k}| sub-models, each associated with a data modality m\in {\mathcal {M}}_{k} . Consequently, users with overlapping data modalities ({\mathcal {M}}_{k} \cap {\mathcal {M}}_{j} \neq \varnothing ) can employ their private data samples to contribute to the global model training. Let f_{m}(\cdot) denote the sub-model corresponding to modality m and \boldsymbol {\omega }_{k}^{(m)} represent the local weights parameters at user k for modality m. Accordingly, the i^{\text {th}} local data sample from modality m\in {\mathcal {M}}_{k} at user k, represented by {\mathcal {X}}_{ki}^{(m)} , is fed into its corresponding sub-model to obtain the modality-specific logits as \boldsymbol {o}_{ki}^{(m)} \; = f_{m}({\mathcal {X}}^{(m)}_{ki},\boldsymbol {\omega }_{k}^{(m)}) . Then, the predicted label of the i^{\text {th}} sample at user k can be obtained as \tilde {{\mathcal {Y}}}_{ki} = \Psi (\boldsymbol {O}_{ki}) , where \boldsymbol {O}_{ki} = [\boldsymbol {o}^{(m)}_{ki}]_{m\in {\mathcal {M}}_{k}} is a matrix containing logits from the M_{k} data modalities.

FIGURE 2. - Late-fusion network architecture of the global model, where each modality is associated with a sub-model. Users possess sub-models corresponding to their available data modalities.
FIGURE 2.

Late-fusion network architecture of the global model, where each modality is associated with a sub-model. Users possess sub-models corresponding to their available data modalities.

4) Loss Function

The global objective in MFL is to minimize the weighted sum of loss value over the local datasets without direct access to the private data. In this regard, the following optimization problem is minimized:\begin{equation*} \underset {\boldsymbol {W}({\mathcal {M}})}{\min } \; \; {\mathcal {L}}_{G}(\boldsymbol {W}({\mathcal {M}})) = \sum _{k \in \mathbb {U}}\pi _{k}{\mathcal {L}}_{k}({\mathcal {D}}_{k},\boldsymbol {W}({\mathcal {M}}_{k})), \tag {1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \boldsymbol {W}({\mathcal {M}}) = [\boldsymbol {\omega }^{(m)}]_{m\in {\mathcal {M}}} and \boldsymbol {W}({\mathcal {M}}_{k})) = [\boldsymbol {\omega }^{(m)}]_{m\in {\mathcal {M}}_{k}} contain the weight parameters corresponding to data modalities in {\mathcal {M}} and {\mathcal {M}}_{k} , respectively. In addition, \pi _{k} is the importance weight of user k, and \sum _{k\in \mathbb {U}}\pi _{k} = 1 . The local objective of each user k in the MFL is to find the weight parameters \boldsymbol {W}_{k} = [\boldsymbol {\omega }_{k}^{(m)}]_{m\in {\mathcal {M}}_{k}} that minimizes the loss over its local data as follows:\begin{align*} \underset {\boldsymbol {W}_{k}}{\min } \; \; {\mathcal {L}}_{k}({\mathcal {D}}_{k},\boldsymbol {W}({\mathcal {M}}_{k})) = \sum _{i = 1}^{|{\mathcal {D}}_{k}|} \ell \Big (F_{k}\big (\boldsymbol {x}_{ki},\boldsymbol {W}({\mathcal {M}}_{k})\big ),y_{ki}\Big ), \tag {2}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where F_{k}\big ({\mathcal {X}}_{ki},\boldsymbol {W}({\mathcal {M}}_{k}) \big) = \tilde {{\mathcal {Y}}}_{ki} is the output of the local multi-modal model of user k to input \boldsymbol {x}_{ki} . In addition, \ell (\tilde {{\mathcal {Y}}}_{ki},{\mathcal {Y}}_{ki}) measures the distance between the predicted label \tilde {{\mathcal {Y}}}_{ki} and the actual label {\mathcal {Y}}_{ki} of i^{\text {th}} sample at user k.

At each communication round r, the global weight parameters of the multi-modal model are updated at the central server as follows:\begin{align*} \boldsymbol {W}^{r}({\mathcal {M}}) = \boldsymbol {W}^{r-1}({\mathcal {M}}) + \sum _{k\in \mathbb {U}}\begin{bmatrix} \boldsymbol {1} \times p_{k,1} \\ \vdots \\ \boldsymbol {1} \times p_{k,M} \end{bmatrix} \odot \boldsymbol {G}^{r}_{k}({\mathcal {M}}), \tag {3}\end{align*}

View SourceRight-click on figure for MathML and additional features.where \boldsymbol {1} is a vector of ones and p_{k,m} is the importance weight of modality m at user k, such that \sum _{k\in \mathbb {U}}p_{k,m}=1, \; \forall m\in {\mathcal {M}} . In addition, \odot (\cdot) is the Hadamard product, and \boldsymbol {G}^{r}_{k}({\mathcal {M}}) = [\vec {\boldsymbol {g}}_{k}^{(m)}]_{m\in {\mathcal {M}}} represents the local gradient of all modalities at user k and communication round r, where \vec {\boldsymbol {g}}^{(m),r}_{k} = \nabla _{m} {\mathcal {L}}_{k}({\mathcal {D}}^{(m)},\boldsymbol {W}^{r}({\mathcal {M}}_{k})) denotes the global gradient of modality m at user k.

Considering the above-mentioned framework, we focus on the problem of MFL over resource-limited CF-mMIMO networks. In this context, we introduce the communication and computation models in the following subsection.

B. Overview of CF-mMIMO Systems

Let us assume that multiple APs, indexed by \mathbb {A} = \{1,\ldots ,P\} , each with N transmitting antennas, are distributed within the network and communicate with the users. There is a high-speed backhaul connection between the APs and a central control unit (CU), as shown in Fig. 1. We assume that the CU has high communication and computation capabilities and is responsible for model aggregation in the FL system. Due to privacy concerns, users are reluctant to share their data with other parties (i.e., users and companies), however, sharing data among multiple devices of a user is allowed. In this regard, the private zone of user k, denoted by z_{k} , is defined as the set of all devices belonging to user k. Accordingly, we divide the communications in the network into a) intra-zone, i.e., communications within a private zone, and b) inter-zone, i.e., communication across multiple private zones.

1) Intra-Zone Communication

Devices belonging to user k are allowed to share raw data within the private zone z_{k} as it does not violate privacy of the user. Therefore, smart devices can benefit from device-to-device communications to collaboratively learn a model while taking into account the available communication and computation resources on the devices.

2) Inter-Zone Communication

Devices within a private zone communicate with other parties in the network to train a global model based on the FL algorithm. In this regard, we consider a user-centric setting where each user k connects with a set of APs, denoted as {\mathcal {A}}(k) \subset \mathbb {A} . In addition, each AP a serves a set of user {\mathcal {U}}(a)\subset \mathbb {U} . Accordingly, the received local updates are collected in the CU for decoding and aggregation.

The device-AP association in CF-mMIMO networks can be applied based on various metrics, e.g., distance of the devices from the APs or the quality of the channel between devices and APs. We consider a Rayleigh fading channel model between APs and devices. Accordingly, the channel between device k and AP a is generated as \boldsymbol {h}_{ka} \sim {\mathcal {CN}}(\boldsymbol {0}_{N},\boldsymbol {R}_{ka}) . In this model, the Gaussian distribution represents the small-scale fading, while the correlation matrix models the large-scale path loss fading. Particularly, the AP-device association indicator {\mathcal {A}}(k) can be obtained as follows\begin{equation*} {\mathcal {A}}(k) = \{k \; \; s.t. \; \; ||\boldsymbol {h}_{ka}||_{2} \geq h^{\text {thr}} \}, \tag {4}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where h^{\text {thr}} is a threshold value for the channel gain.

3) Delay Model

Data modalities may produce heterogeneous delays due to their specific characteristics, the employed network architecture, and the amount of computations required for local updates. The delay model of each user k at communication round r is given as\begin{equation*} t_{k}^{r} = d_{k}^{\text {ul},r} + d_{k}^{\text {comp},r} + d_{k}^{\text {dl},r}, k = 1,\ldots ,K, \tag {5}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where the three terms represent uplink transmission, computations, and downlink transmission delays, respectively. Note that we assume perfect backhaul communication between APs and the CU, such as via fiber optics. Consequently, backhaul delay is neglected in our model.

a: Uplink Transmission

we employ Non-Orthogonal Multiple Access (NOMA) for uplink transmission mode as it allows participating devices to transmit within the same frequency-time blocks. Let x_{k} and p_{k} denote the data signal and transmit power of user k, where \mathbb {E}\{x_{i} x_{j}^{*}\} = \delta _{ij} . The received signal at AP a\in \mathbb {A} can be written as \boldsymbol {y}_{a} = \sum _{k = 1 }^{K_{s}^{r}}\sqrt {p_{k}}x_{k} \boldsymbol {h}_{ka} + \boldsymbol {n}_{a} , where K_{s}^{r} \leq K is the number of selected users at communication round r (K_{s}^{r} = |\mathbb {U}_{s}^{r}| ). In adition, \boldsymbol {n}_{a}\in \mathbb {C}^{N \times 1} is the additive white Gaussian noise at AP a. The APs forward the received signals to the CU using a high-speed connection whose delay is ignored in our work. In this regard, the received signal in the CU, denoted as \boldsymbol {y}\in \mathbb {C}^{NP \times 1} , is given as \boldsymbol {y} = \sum _{k = 1}^{K_{s}^{r}}{\sqrt {p_{k}}}x_{k}\boldsymbol {h}_{k} + \boldsymbol {n} , where \boldsymbol {h}_{k} = [\boldsymbol {h}^{T}_{k1},\ldots ,\boldsymbol {h}^{T}_{kP}]^{T} \in \mathbb {C}^{NP \times 1} is the collective channel of user k, i.e., from the user to all APs. Note that while each user k connects to only a set of APs {\mathcal {A}}(k) , its transmitted signal will be received at other APs as interference. Accordingly, to reduce the required amount of computations, the APs only estimate the channel with the users they are serving. In this regard, we define the estimated channel between user k and AP a as \tilde {\boldsymbol {h}}_{ka} = \boldsymbol {D}_{ka}\boldsymbol {h}_{ka} , where \boldsymbol {D}_{ka} is the identity matrix \boldsymbol {I}_{N \times N} , if user k is served by AP a, and \boldsymbol {0}_{N \times N} otherwise.

The CU employs the SIC algorithm to decode the received model updates x_{k} from each participating user k \in \mathbb {U}_{s}^{r} . To this end, the decoding process is initiated by prioritizing the high-power signal and treating the low-power signals as interference. Subsequently, the receiver removes the decoded high-power signal from the received signal to facilitate the decoding of the low-power signal. Accordingly, the achievable data rate for each user k\in \mathbb {U}_{s}^{r} is given as r_{k}^{ul} = B \times \log \left ({{1 + \frac {p_{k}||\boldsymbol {h}_{k}||^{2}}{{\mathcal {N}}_{k} + {\mathcal {I}}_{k}}}}\right) , where B and {\mathcal {N}}_{k} denote the available bandwidth for uplink transmission and the noise power at user k. In addition, {\mathcal {I}}_{k} is the inter-user interference of after SIC for user k obtained as {\mathcal {I}}_{k} = \sum _{i = k+1}^{K_{s}^{r}}p_{i}||\boldsymbol {h}_{i}||^{2} .\begin{equation*} d_{k}^{\text {ul}} = \frac {\sum _{m=1}^{M} g_{k,m}v^{r}_{k,m}}{r_{k}^{\text {ul}}}, \tag {6}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where g_{k,m} is the size of the gradient and v^{r}_{k,m} is a binary indicator for selecting modality at user k in round r.

b: Computations Delay

Denoting C_{k,m} as the number of local CPU cycles at user k for updating model m at user k, the corresponding computation delay is given as\begin{equation*} d_{k}^{\text {comp}} = \frac {\sum _{m=1}^{M} C_{k,m}\psi _{k,m}v^{r}_{k,m}}{f^{r}_{k}}, \tag {7}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \psi _{k,m} and f^{r}_{k} represent the required number of bits for the local update of model m, and the processing frequency at user k in communication round r.

c: Downlink Transmission

In the downlink transmission, i.e, stage (S5), where APs send the updated models to the users, we employ OMA scheme. Accordingly, sub-models associated with different data modalities are assigned with orthogonal resources. In this regard, OMA ensures that the transmitted downlink messages are tailored to specific modalities, indicating that users with identical data modalities receive similar models. Consequently, as the number of selected users at each round (i.e., K_{s}^{r} ) increases, the transmitted signal remains consistent. Denoting B_{m}^{\text {dl}} , p_{a}^{m} as bandwidth and transmission power allocated to sub-model m, the downlink transmission rate and delay is obtained similar to uplink transmission equations.

d: Energy Consumption

The energy consumption at each user k \in \mathbb {U} comprises the energy consumption for local computations and uplink transmission, formulated as\begin{equation*} e_{k}^{r} = e_{k}^{r,\text {comp}} + e_{k}^{r,\text {ul}}, \tag {8}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where e_{k}^{r,\text {ul}} = p_{k}^{r} \times d_{k}^{r,\text {ul}} denotes the energy consumption of uplink transmission. In addition, let \epsilon _{k} represent the energy consumption of one CPU cycle at user k, the computations energy consumption depends on the selected data modalities and is calculated as\begin{equation*} e_{k}^{r,\text {comp}} = \epsilon _{k}(f_{k}^{r})^{2} \sum _{m=1}^{M} C_{k,m}\psi _{k,m}v^{r}_{k,m}. \tag {9}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

SECTION IV.

Device-Modality Selection and Resource Allocation for MFL in CF-mMIMO Systems

In this section, we first formulate the decision-making problem in MFL over resource-limited CF-mMIMO networks, corresponding to the (S1) stage of the MFL framework outlined in Section III-A.1. To address this problem and improve the performance of MFL, we subsequently propose a device-modality selection and resource allocation algorithm.

A. Problem Formulation

To train the multi-modal network structure in Fig. 2, we break it into M sub-models. Accordingly, each device k takes part in training its available data modalities m \in {\mathcal {M}}_{k} . Due to the specific characteristics of data modalities and their corresponding sub-models, each modality may have different convergence speeds, requiring a different number of communication rounds. To this end, we opt to select data modalities and participating users at each communication round that make a higher contribution to the multi-modal network. Note that based on the capacity of the communication channel, multiple sub-models may be selected in each round.

In the proposed MFL framework, employing the information collected from the network, the server selects participating users, assigns data modalities for updates, and allocates communication resources for uplink transmission. In this regard, after completion of each round of MFL, the available sub-models at each user k are fused to construct local multi-modal networks. In the MFL process, the M sub-models are trained simultaneously considering the communication and computation limitations. Let T_{r} represent the execution time of MFL in communication round r. The overall delay of the MFL process is given by:\begin{equation*} T(\boldsymbol {S}, R, \boldsymbol {a}) = \sum _{r \in \mathbb {R}} T_{r}(\boldsymbol {s}_{r}, \boldsymbol {a}_{r}), \tag {10}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \boldsymbol {s}_{r} and \boldsymbol {a}_{r} denote the system state and decision vector in communication round r, respectively. Here, \boldsymbol {S} = [\boldsymbol {s}_{1}, \ldots , \boldsymbol {s}_{R}] and \boldsymbol {A} = [\boldsymbol {a}_{1}, \ldots , \boldsymbol {a}_{R}] represent the state and decision vectors for all R rounds. The state \boldsymbol {s}_{r} depends on the random channel conditions, remaining energy, and available computation capabilities of all users in round r, introducing uncertainty. In addition, let us denote \boldsymbol {v}^{r} = [\boldsymbol {v}^{r}_{1},\ldots ,\boldsymbol {v}^{r}_{K}] , where \boldsymbol {v}^{r}_{k} = [v^{r}_{k,1},\ldots ,v^{r}_{k,m}] , as the selection vector, \boldsymbol {p}^{r} = [p^{r}_{1},\ldots ,p^{r}_{K}] as the uplink power allocation vector, and \boldsymbol {f}^{r} = [f^{r}_{1},\ldots ,f^{r}_{K}] as the vector of processing frequency at communication round r. The action at round r can be written as \boldsymbol {a}_{r} = [\boldsymbol {v}^{r},\boldsymbol {p}^{r}, \boldsymbol {f}^{r}] . The execution time of the FL process in each communication round r is determined by the slowest participating user as follows:\begin{equation*} T_{r}(\boldsymbol {s}_{r},\boldsymbol {a}_{r}) = \underset {k}{\max } \; \{ \nu ^{r}_{k}t_{k}^{r} \}, \tag {11}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \nu ^{r}_{k} = \text {sign}\left ({{\sum _{m = 1}^{M}v^{r}_{k,m}}}\right) is an indicator for selecting user k and t_{k}^{r} is the delay of user k at round r.

The goal of MFL is to train a model that minimizes the global loss function {\mathcal {L}}_{G}(\boldsymbol {W}^{(r)}({\mathcal {M}})) at the end of the MFL process. Hence, with a long-term perspective, we consider the following objective function:\begin{equation*} {\mathcal {L}}(\boldsymbol {S},R,\boldsymbol {a}) = {\mathcal {L}}_{G}(\boldsymbol {W}^{(R)}({\mathcal {M}})). \tag {12}\end{equation*}

View SourceRight-click on figure for MathML and additional features.Note that the above objective function depends on the selected users and modalities as well as the local gradients in all communication rounds r \in \mathbb {R} .

Considering the objective functions in (10) and (12), the following optimization problem is defined for performance improvement in MFL:\begin{align*} & \boldsymbol {{\mathcal {P}}}1) \; \; \underset {\boldsymbol {a}}{\min }\quad \{ T(\boldsymbol {S},R,\boldsymbol {a}), {\mathcal {L}}(\boldsymbol {S},R,\boldsymbol {a})\} \tag {13}\\ & \hspace {0.8cm}\textbf {s.t.} \; \; \mathbf {C1.} \ \ v_{k,m}^{r}\in \{0,1\}, \; \forall k \in \mathbb {U}, \; \forall m \in {\mathcal {M}}, \; \forall r\in \mathbb {R}, \tag {14}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }\mathbf {C2.} \ \ t^{r}_{k} \leq \tau _{\text {th}}, \; \forall k \in \mathbb {U}_{s}^{r}, \; \forall r \in \mathbb {R}, \tag {15}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }\mathbf {C3.} \ \ e_{k}^{r} \leq e_{k}^{\max }, \; \forall k \in \mathbb {U}_{s}^{r}, \; \forall r \in \mathbb {R}, \tag {16}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }\mathbf {C4.} \ \ 0 \leq p_{k}^{r} \leq p_{k}^{\max }, \; \forall k \in \mathbb {U}_{s}^{r}, \; \forall r \in \mathbb {R}, \tag {17}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }{\mathbf {C5.} \ \ f_{k}^{\min } \leq f_{k}^{r} \leq f_{k}^{\max }, \; \forall k \in \mathbb {U}_{s}^{r}, \; \forall r \in \mathbb {R}}, \tag {18}\\ & \hphantom { \hspace {0.8cm}\textbf {s.t.} \; \; }\mathbf {C6.} \ \ \sum _{\forall k \in \mathbb {U}}\nu ^{r}_{k} \geq N^{\text {QoL}}, \; \forall r \in \mathbb {R}, \tag {19}\end{align*}

View SourceRight-click on figure for MathML and additional features.where C2 indicates the delay constraint for the current communication round, while C3 guarantees that the energy consumption of none of the smart devices exceeds its limitation. In addition, constraint C4 determines the lower and upper bound for transmission power at each user. Furthermore, constraint C5 guarantees that the CPU frequency in each devices remains within the range [f_{k}^{\min },f_{k}^{\max }] in each communication round. Note that delay, energy consumption, and transmission power at round r depend on the state of the system which may change multiple times during that round due to small-scale fading. Furthermore, constraint C6 guarantees that a minimum number of users, denoted by N^{\text {QoL}} participate in each communication round, ensuring the convergence of the MFL. Note that the two objective functions in problem \boldsymbol {{\mathcal {P}}}1 are not independent of each other. Accordingly, reducing the global loss requires more participants in the MFL process, resulting in larger delay.

B. Proposed Solution

Achieving an optimal solution for optimization problem \boldsymbol {{\mathcal {P}}}1 requires access to the full information of system state \boldsymbol {s}_{r} in each communication round r\in \mathbb {R} . However, the Channel State Information (CSI) between users and APs is changed after a small-scale coherence time and is not available for future rounds. In addition, the loss function {\mathcal {L}}_{G}(\boldsymbol {W}^{R}({\mathcal {M}})) is sequentially influenced by the selected participants (i.e., device-modality pairs) and local updates in the previous rounds r \leq R . Therefore, we focus on each communication round for decision-making in MFL. In this regard, we break the optimization problem \boldsymbol {{\mathcal {P}}}1 into two sub-problems in each communication round: device-modality selection and resource allocation. To address this, we propose two solutions corresponding to the two sub-problems identified in each communication round: device-modality selection and resource allocation.

1) Proposed Device-Modality Selection

Recent works [32], [33] have shown that the participation of a greater number of users can improve the convergence speed in the FL process. However, due to limited communication resources in CF-mMIMO networks, only a subset of devices can be selected in each communication round since a large number of participants leads to significant delays (i.e., straggler effect [34]). In addition, device-modality pairs contribute differently to training the multi-modal model in Fig. 2 due to system and data heterogeneity. Therefore, it is imperative to prioritize devices and modalities that make the most contribution to the learning process. One approach to achieve this is to consider a Single-class Queue (SingleQ), where all device-modality pairs are considered in a single queue and prioritized based on a unified metric. In this case the global loss function can be estimated as \tilde {{\mathcal {L}}}(\boldsymbol {S}_{r},\boldsymbol {a}_{r}) \; = \; -\sum _{k = 1}^{K}\sum _{m = 1}^{M} \gamma ^{r}_{k,m}v^{r}_{k,m} , where \gamma ^{r}_{k,m} is the unified metric for modality m and user k. Although this approach simplifies the selection to typical unimodal user selection schemes, finding a unified metric that can indicate the importance of different modalities is challenging due to data heterogeneity.

In our proposed selection scheme, device-modality pairs are modeled as a Multi-class Queue (multiQ) where priority is given based on modality-specific metrics at each queue. Accordingly, let {\mathcal {Q}}_{m} = \{(1_{m},m),\ldots ,(K_{m},m)\} denote the priority queue for modality m, where \gamma ^{r}_{1_{m},m}\leq \gamma ^{r}_{2_{m},m} \; \leq , \; \ldots ,\leq \gamma ^{r}_{K_{m},m} . In this case, we define the following multi-objective optimization:\begin{align*} & \hspace {-.5cm}\boldsymbol {{\mathcal {P}}}^{\text {DMS}}_{\text {MultiQ}}) \; \; \underset {\boldsymbol {v}^{r}}{\min }\quad L_{1},\ldots ,L_{M} \tag {20}\\ & \hspace {0.9cm}\textbf {s.t.} \; \; \mathbf {C1} \sim \mathbf {C6} \\ & \hphantom {\hspace {0.9cm}\textbf {s.t.} \; \; }\mathbf {C7)} \; -\sum _{k = 1}^{K} \zeta ^{r}_{k,m}v^{r}_{k,m} \geq N_{m}^{\text {QoL}}, \; \forall m \in {\mathcal {M}} \tag {21}\end{align*}

View SourceRight-click on figure for MathML and additional features.where L_{m}=\sum _{k \in \mathbb {U}}\zeta _{k,m}v^{r}_{k,m} is the objective function corresponding to modality m with \zeta _{k,m} denoting the importance weight of modality m from user k and \sum _{k\in \mathbb {U}}\zeta _{k,m} = 1 . In addition, constraint C7 ensures that a minimum of N^{\text {QoL}}_{m} devices is chosen for each modality m \in {\mathcal {M}} . We propose a greedy algorithm to solve the above optimization problem summarized in Algorithm 1. At each iteration of the algorithm, a device-modality pair is added to the selection set \mathbb {U}_{s}^{r} . During the selection process, we assume a worst-case scenario where users employ their maximum transmission power, i.e., maximum energy consumption and interference across users. Accordingly, if the constraints C2 and C3 hold with the added device-modality, the corresponding pair is selected for communication round r, i.e., v^{r}_{k,m} = 1 . Otherwise, modality m at user k is ignored for the current round. Additionally, in each iteration, similar to the Round Robin algorithm, device-modality pairs are added to the selection set. Unlike Round Robin, the selection from modality m is paused if \sum _{k\in \mathbb {U}}v^{r}_{k,m} \geq N_{m}^{\text {QoL}} , and \exists m'\in {\mathcal {M}}, \; \sum _{k\in {\mathcal {U}}}v^{r}_{k,m'} \lt N_{m'}^{\text {QoL}} .

Algorithm 1 Proposed MultiQ Algorithm for Device-Modality Selection

Input:

The set of all users \mathbb {U} , communication round r, importance weights \gamma ^{r}_{k,m} , state vector \boldsymbol {s_{r}} .

1:

{\mathcal {Q}}_{m}^{r} \leftarrow \varnothing , \; \forall m \in {\mathcal {M}}

2:

v^{r}_{k,m} \leftarrow 0, \forall k\in \mathbb {U}, \forall m \in {\mathcal {M}} .

3:

Construct {\mathcal {Q}}^{r}_{m}, \; \forall m \in {\mathcal {M}} by sorting (k,m) pairs \forall k \in \mathbb {U} using HeapSort [31] based on \gamma ^{r}_{k,m} .

4:

for m \in {\mathcal {M}} s.t. \sum _{k\in {\mathcal {U}}_{s}^{r}} v^{r}_{k,m} \leq N_{m}^{\text {QoL}} do

5:

k^{*},m^{*} \leftarrow Dequeue({\mathcal {Q}}^{r}_{m})

6:

v^{r}_{k^{*},m^{*}} \leftarrow \text {Select-Check}(k^{*},m^{*})

7:

end for

8:

while \exists m\in {\mathcal {M}}, \text {s.t.} |{\mathcal {Q}}_{m}^{r}| \neq \varnothing do

9:

k^{*},m^{*} \leftarrow Dequeue({\mathcal {Q}}_{m}^{r})

10:

v^{r}_{k^{*},m^{*}} \leftarrow \text {Select-Check}(k^{*},m^{*})

11:

end while

12:

function Select-Check(k,m)

13:

v^{r}_{k^{*},m^{*}} \leftarrow 1

14:

p_{k^{*}} \leftarrow p_{k^{*}}^{\max }

15:

Update d_{k^{*}}^{\text {ul}}, \; \forall k \in \mathbb {U}^{r}_{s} , according to (6)

16:

Compute d_{k^{*}}^{\text {comp}} according to (7)

17:

Compute d_{k^{*}}^{\text {dl}} similar to (6)

18:

if t_{\ell } ^{r} \gt \tau _{\text {th}} or e_{\ell } ^{r} \gt e_{\ell } ^{\max }, \; \forall \ell \in \mathbb {U}^{r}_{s} then

19:

v^{r}_{k^{*},m^{*}} \leftarrow 0

20:

end if

21:

return v^{r}_{k,m}

Output:

The selected device-modality vectors \boldsymbol {v}^{r}_{k}, \; \forall k \in \mathbb {U} .

2) Proposed Resource Allocation

Although the device-modality selection pairs are fixed during a communication round, the communication channel may change multiple times in that period. Hence, the resources need to be tuned accordingly. Obtaining the device-modality selection from sub-problem 1, the resource allocation optimization problem at communication round r can be rewritten as follows:\begin{align*} & \hspace {-.5cm}\boldsymbol {{\mathcal {P}}}^{\text {RA}})\qquad \; \underset {\boldsymbol {p},\boldsymbol {f}}{\min }\quad T_{r}(\boldsymbol {s}_{r},\boldsymbol {a}_{r}) = \underset {k}{\max }\{ \nu _{k} t^{r}_{k}\} \tag {22}\\ & \hspace {0.9cm}\textbf {s.t.} \; \; \mathbf {C2} \sim \mathbf {C5}.\end{align*}

View SourceRight-click on figure for MathML and additional features.

Compared to the initial problem in \boldsymbol {{\mathcal {P}}}1 , the integer variables for device-modality selection (i.e., \boldsymbol {v}^{r} ) are known, hence constraints C1 and C6 are already satisfied. The resource allocation problem in \boldsymbol {{\mathcal {P}}}^{\text {RA}} is non-convex and cannot be solved using standard convex optimization techniques. To solve this problem, we use the PSO algorithm, an iterative algorithm inspired by the communication of birds and fish that can be employed to solve optimization problems. It considers a swarm with J elements, referred to as agents or particles, where each agent j represents a candidate solution moving in the feasible search space bounded by the constraints of the optimization problem. Without loss of generality, we set \boldsymbol {\chi } = [\boldsymbol {P},\boldsymbol {F}] . In addition, each particle j is allocated a position and velocity denoted by \boldsymbol {\chi }^{(j)} \; = [p^{(j)}_{1},\ldots ,p^{(j)}_{K_{s}},f^{(j)}_{1},\ldots ,f^{(j)}_{K_{s}}] and \boldsymbol {\delta }^{(j)} \; = [\delta ^{(j)}_{1},\ldots ,\delta ^{(j)}_{2K_{s}}] , respectively. The vectors \boldsymbol {\chi } and \boldsymbol {\delta } are randomly initialized at the beginning of the algorithm. Then, the position and velocity of particles are updated iteratively according to the current best position of each particle and the swarm as follows:\begin{align*} \boldsymbol {\chi }^{(j)}(t+1) & = \boldsymbol {\chi }^{(j)}(t) + \boldsymbol {\delta }^{(j)}(t+1) \tag {23}\\ \boldsymbol {\delta }^{(j)}(t+1) & = \kappa \boldsymbol {\delta }^{(j)}(t) + r_{1}c_{1}\big (\boldsymbol {\chi }^{(j)}_{\text {best}}(t) - \boldsymbol {\chi }^{(j)}(t)\big ) \\ & \quad + r_{2}c_{2}\big (\boldsymbol {\chi }^{\text {swarm}}_{\text {best}}(t) - \boldsymbol {\chi }^{(j)}(t)\big ), \tag {24}\end{align*}

View SourceRight-click on figure for MathML and additional features.where t is the index of the current iteration. Additionally, \boldsymbol {\chi }^{(j)}_{\text {best}}(t) and \boldsymbol {\chi }^{\text {swarm}}_{\text {best}}(t) represent the current position of particle j and the swarm, respectively. Furthermore, \kappa is the inertia coefficient for adjusting the convergence speed, c_{1} \geq 0 and c_{2} \geq 0 are acceleration coefficients, r_{1} and r_{2} are random real values in the range [{0,1}] . In each iteration t of the proposed PSO algorithm, the current best position of particle j and the current best position of the swarm are updated as\begin{align*} \boldsymbol {\chi }^{(j)}_{\text {best}}(t)& = \underset {\boldsymbol {\chi } = \boldsymbol {\chi }^{(j)}(t'), \; t' = 0,\ldots ,t}{\text {arg min}} \; {\mathcal {C}}(\boldsymbol {\chi }), \tag {25}\\ \boldsymbol {\chi }^{\text {swarm}}_{\text {best}}(t)& = \underset {\boldsymbol {\chi } = \boldsymbol {\chi }^{(j)}_{\text {best}}(t), \; \forall j = 1,\ldots ,J }{\text {arg min}} \; {\mathcal {C}}(\boldsymbol {\chi }), \tag {26}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where C(\boldsymbol {\chi }) is the cost function of the PSO algorithm at position \boldsymbol {\chi } . In this regard, to ensure constraint satisfaction in the PSO algorithm, we convert the constraints C2-C4 into penalties added to the objective function. To this end, we introduce the indicator function I(z) = -\frac {\log (-z)}{\varrho } , where \varrho is a positive parameter for scaling the penalty. Function I(z) returns negative values when z \lt 0 , i.e., the constraint is satisfied. As z increases to 0, I(z) increases to \inf . According, we define the cost function as in (27), shown at the bottom of the page, \begin{equation*} {\mathcal {C}}(\boldsymbol {\chi }) = T(\boldsymbol {s}_{r},\boldsymbol {a}_{r})+ \sum _{k\in \mathbb {U}_{s}} \Big ( I(t_{k}^{r}-\tau ^{\text {th}}) + I(e_{k}^{r}-e_{k}^{\max }) + I(p_{k}^{r}-p_{k}^{\max }) + I(-p_{k}^{r}) + I(f_{k}^{r}-f_{k}^{\max }) + I(f_{k}^{\min } - f_{k}^{r}) \Big ) \tag {27}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
for the PSO algorithm, where the three terms correspond to constraints C2, C3, and C4, respectively. A summary of the proposed PSO algorithm for solving \boldsymbol {{\mathcal {P}}}^{\text {RA}} is shown in Algorithm 2.

Algorithm 2 Proposed PSO Algorithm for Resource Allocation in MFL

Input:

\boldsymbol {s}_{r} ,\boldsymbol {v}^{r} ,\mathbb {U}_{s}^{r} , \kappa , c_{1} , c_{2} , r_{1} , r_{2} .

1:

Set velocity \boldsymbol {\delta }^{(j)}(0) = 0, \; \forall j = 1,\ldots ,J .

2:

Set starting positions to \boldsymbol {\chi }^{(j)} \; = [\boldsymbol {p}^{(j)},\boldsymbol {v}^{(j)}], \; j = 1,\ldots , J randomly in range 0 \leq p_{k}^{(j)}\leq 1 and f_{k}^{\min } \leq f_{k}^{(j)} \; \leq f_{k}^{\max }, \forall k \in \mathbb {U}_{s}^{r}

3:

for i = 1:itr do

4:

for j = 1:J do

5:

Update position using (23)

6:

Update velocity using (24)

7:

end for

8:

Update t_{k}^{r}, \; \forall k \in \mathbb {U}_{s}^{r} based on (5)

9:

Update e_{k}^{r}, \; \forall k \in \mathbb {U}_{s}^{r} based on (8)

10:

update local best position \boldsymbol {\chi }^{(j)}_{\text {best}}(t),\forall j = 1,\ldots ,J according to (25)

11:

update global best position \boldsymbol {\chi }^{\text {swarm}}_{\text {best}}(t) according to (26)

12:

end for

Output:

The transmit power p^{r}_{k} , and computation frequency f_{k}^{r}, \; \forall k \in \mathbb {U}_{s}^{r} .

C. Device-Modality Priority Metric

The weight parameters \gamma ^{r}_{k,m} in the queueing models allow prioritizing device-modality pairs based on different metrics. We present the following two categories:

Probabilistic selection: One common way of selecting data modalities and devices is to select them based on a pre-determined distribution. However, this approach ignores the fact that different devices may have dissimilar data distributions for a specific modality and hence contribute differently to the training process. In addition, cross-modal heterogeneity makes particular modalities more important in the training process.

Gradient-based Selection: Gradients have been widely used in the literature for device selection in unimodal FL. However, comparing the gradients across different modalities is challenging as the associated sub-models might have different number of parameters. In this case, we introduce the following metrics based on the gradient for device-modality selection in MFL:

1) Absolute of Gradients (AbsG)

Absolute of gradients measures the significance of the local updates for a user. In this case, importance weights are formulated as \gamma ^{r}_{k,m} = |\boldsymbol {g}^{r}_{k,m}| . The main issue of AbsG is that different sub-models have unequal number of parameters and typically one of the modalities may be dominant during the selection scheme.

2) Normalized Absolute of Gradient (NAbsG)

To adapt the AbsG with the multi-modal case, we define the Normalized Absolute of Gradient (NAbsG) as a metric to select device-modality pairs as follows:\begin{equation*} \gamma _{k,m}^{r} = \frac {|\boldsymbol {g}_{k,m}^{r}|}{n_{m}}, \tag {28}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where n_{m} is the total number of parameters in sub-model m.

SECTION V.

Simulation Results

In this section, we evaluate the performance of the proposed MFL framework as described in subsection III-A.1. Using HAR as an illustrative application, we first outline the setup of the multi-modal dataset, its distribution across clients, and the communication system configuration shown in Fig. 1. We then introduce performance metrics to analyze the framework from the perspectives of multi-modal machine learning and communications. Finally, we present the evaluation results and discussions, demonstrating the improvements in model performance and training latency achieved with the proposed late-fusion model and decision-making framework for MFL. Specifically, under full client participation, we validate the benefits of the late-fusion model for MFL and demonstrate its robustness to missing modalities during inference, outperforming alternative fusion methods. Furthermore, we show the effectiveness of the proposed device-modality selection in enhancing joint accuracy of sub-models while reducing latency when only a subset of device-modality pairs are selected. Finally, we highlight the effectiveness of CF-mMIMO networks for MFL and evaluate the proposed resource allocation scheme in further reducing latency within resource-limited systems.

A. Simulation Setup

1) Dataset Setup

We used the HARWE dataset [35], which consists of K = 20 users performing 9 activity classes. The number of all modalities is M=3 , captured from three different devices, namely, smartphone, smartwatch, and smartspeaker. The smartphone and smartwatch capture inertial data of the users, while the smartspeaker records the audio of its environment. Note that each user may have a subset of these modalities subject to the quality of collected data. The local dataset of each user is divided into training, validation, and test sets with the proportion of 80%, 5%, and 15%, respectively. For each modality, we use two CNN layers with 64 and 32 filters of size 3 \times 3 and max-pooling 2 \times 2 , followed by two fully-connected layers with 64 and 32 nodes.

2) Communication Network

We consider a CF-mMIMO network (see Fig. 1), where P = 100 APs, each with N = 4 transmitting antennas, are randomly distributed over a 400 \times 300 area. The maximum transmit power in the uplink transmission mode and the maximum energy consumption in each communication round are set to p^{\max }_{k} = 20 \; \text {dBm} and e_{k}^{\max } = 8 \; \text {J}, \; \forall k \in \mathbb {U} , respectively. We set the total uplink bandwidth to B = 20 MHz. In addition, the noise power of the received signals at APs for each user is {\mathcal {N}}_{k} = -94 \text {dBm} . The lower and upper bounds for local processing frequency of smart devices are set to f_{c}^{\min } = 1.5 and f_{c}^{\max } = 2 GHz, respectively. We consider J = 200 particles in the PSO algorithm, and the acceleration and inertia coefficients are set to r_{1} = r_{2} = 0.5 and \kappa = 0.8 .

B. Evaluation Metrics

To assess the performance of the proposed MFL framework, we consider the following performance indices.

1) Classification Metrics

The classification test accuracy of the global model, denoted by A, is defined as the number of correctly labeled data samples to the number of misclassified samples in the test set and formulated as A = \mathbb {E}\left \{{{\frac {n_{\text {Corr}}}{n_{\text {Incorr}}}}}\right \} . Denoting A_{m}^{r} as the global accuracy of sub-model m at round r, to project the effect of all available data modality on the global model, we define the following evaluation metric\begin{equation*} I_{r}^{\text {ML}} = \bar {A}^{r} \exp \left ({{ -\frac {1}{M}\sum _{m=1}^{M}{(A_{m}^{r} - \bar {A}^{r})^{2}} }}\right ), \tag {29}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \bar {A}^{r} = \frac {1}{M}\sum _{m=1}^{M} A_{m}^{r} is the average test accuracy of all sub-models at the end of round r. Note that the above metric increases when the average accuracy of sub-models rises. In addition, it decreases when the difference between the accuracy of sub-models increases.

2) Latency

The latency of each communication round is defined as (10). Additionally, we define the average delay desynchronization for round r as follows:\begin{equation*} I_{r}^{\text {desync}} = \frac {1}{K_{s}} \sum _{k\in \mathbb {U}_{s}^{r}} (d_{k}^{r} - T_{r}(\boldsymbol {s}_{r},\boldsymbol {a}_{r})). \tag {30}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

C. Evaluation Results

We present our experiment results in three folds to evaluate the performance of i) the proposed late-fusion structure, comparing it with early-fusion [17] and intermediate-fusion [18] alternatives, ii) the proposed device-modality selection scheme in comparison with existing approaches from the literature [28], [36], [37], [38], [39], and iii) the MFL over CF-mMIMO networks particularly with the proposed PSO power allocation technique.

1) Multi-Modal Data Fusion in FL

Each multi-modal user needs to construct a unified local model that matches its local dataset based on the available data modalities. The updated models are uploaded to the CU for aggregation in each communication round. We evaluate the performance of our proposed late-fusion multi-modal structure (Fig. 2) against baseline models in terms of test accuracy within the MFL framework. Additionally, we analyze its ability to address the challenge of missing modalities during the inference phase compared to alternative approaches. Finally, the computational requirements of the proposed model are compared with those of the baselines. Accordingly, we consider two types of fusion models widely used in the literature, namely, early fusion [17] and intermediate fusion [18]. In the early fusion model structure, raw features of all data modalities are combined before fitting into the neural network. However, intermediate fusion combines extracted features of all modalities and uses them as the input of a joint classifier. Although the proposed late-fusion structure is compatible with scenarios with ID-space and feature-space heterogeneity, the EF and IF models require a) each local dataset to have aligned multi-modal samples, i.e., Homo-MFL, and b) all local datasets have the same data modalities. Accordingly, to allow a comparison between LF and the two alternatives, we consider Homo-MFL in this part of the experiments. In addition, for the sake of fair comparison between different fusion models, we consider a similar neural network structure for all three fusion models. We compare the performance of the three fusion models during both training and inference within the context of MFL.

Fig. 3 illustrates test accuracy versus communication round for the three fusion types when all users are selected in each communication round. As it can be seen, both the early-fusion and intermediate-fusion structures slightly outperform the late-fusion model in terms of accuracy. Particularly, in early and intermediate fusions, the interconnected structure of data modalities is captured resulting in better performance. Table 2 compares the Floating-point Operations Per Second (FLOPS) and the upload size of each model. The proposed late-fusion structure has the lowest FLOPS demanding less computations. In addition, intermediate fusion provides the least model size compared to the other two baselines resulting in less latency on average. However, the main advantage of the late-fusion structure is the degree of freedom in selecting the device-modality pairs that allows (a) flexible participation of modality-specific sub-models, and (b) robustness to missing modalities. This important property makes the late-fusion model an ideal structure for Hetero-MFL scenarios with ID-space and feature-space data heterogeneity.

TABLE 2 Comparison Between FLOPS and Size of Different Fusion Models
Table 2- Comparison Between FLOPS and Size of Different Fusion Models
FIGURE 3. - Accuracy versus communication rounds for different data fusion models in MFL.
FIGURE 3.

Accuracy versus communication rounds for different data fusion models in MFL.

To evaluate robustness to missing data, we consider the scenario where one or two of the data modalities are missing in the inference phase. In this regard, for early-fusion and intermediate-fusion baselines, reconstruction networks are trained locally to recover the missing data. However, the proposed late-fusion structure does not require a reconstruction network as each sub-model in the network structure can make independent decisions about the activity class labels. The bar charts in Fig. 4 (a) and (b) compare the test accuracy of the three models where one and two data modalities are missing, respectively. As can be seen, the late-fusion structure is more robust to missing data, while the accuracy of early fusion and intermediate fusion models drops with missing data. In fact, the averaging function \Psi in the late-fusion model disregards the missing modality by adjusting the averaging weights to available modalities. This results in an average test accuracy improvement of 15% and 23% for the proposed late-fusion model, compared to the best-performing alternative, when one and two modalities are missing, respectively.

FIGURE 4. - Comparison of different fusion models with a) one, and b) two missing modalities in the inference phase. W: watch modality, P: phone modality, and S: speaker modality.
FIGURE 4.

Comparison of different fusion models with a) one, and b) two missing modalities in the inference phase. W: watch modality, P: phone modality, and S: speaker modality.

2) Multi-Modal Device Selection

At the start of each communication round, i.e., Stage (S1), the CU decides which device-modality pairs are selected for participation in the current round based on Algorithm 1. Considering the limitations of the communication network, a desired selection scheme picks the device-modality pairs that make the most contribution to the MFL process in each communication round while producing less delay. Accordingly, an intended HAR model maximizes the average accuracy of uni-modal sub-models, and minimizes the gap between various sub-models corresponding to different data modalities resulting in an increased joint accuracy as formulated in (29).. In light of this, we compare our proposed MultiQ device-modality selection schemes with five baselines: i) Full Selection (FS), where all devices and modalities participate in the MFL process [24], [36], [37], [38], ii) Client Selection (CS) in which the whole multi-modal dataset of a selected client (i.e., user) is employed to update the local model in each round [23], [25], [39], iii) Modality-aware (MA) scheme [28], where the combination of size and effect of sub-models is employed as the selection metric, iv) random selection, where devices and modalities are selected randomly in each round, and v) SingleQ, where device-modalitiy pairs are prioritized based on AbsG and NAbsG metrics. Note that, except in the FS scheme, only half of the device-modality pairs are selected at each communication round in the other methods.

Fig. 5 (a) and (b) compare the aforementioned selection schemes in terms of accuracy and latency over a resource-limited wireless network. As it can be seen, the proposed MultiQ-Abs method provides only slightly less accuracy compared to the FS scheme, despite selecting only half of the possible device-modality pairs. It achieves higher joint accuracy compared to the CS and MA schemes, as our proposed MultiQ-AbsG prioritizes device-modality pairs with the greatest impact on the model. In contrast, the CS scheme selects all modalities of a chosen client, often including device-modalities with minimal contribution. Similarly, the MA scheme tends to favor smaller models, which typically have less impact on the overall fusion model, resulting in an increased performance gap among different sub-models. The SingleQ-AbsG scheme produces the least test accuracy since modalities with larger sub-models possess larger gradients, and hence have a higher chance of being selected over those with smaller sub-models. The performance of this selection scheme is enhanced by the NAbs selection metric, where the size of gradients are normalized. Regarding the latency, the FS scheme provides the largest delay due to the high number of participants, while the CS and random selection methods experience less delay since a uniform portion of modalities are selected. Furthermore, the SingleQ-AbsG and SingleQ-NAbs methods exhibit the least delay since they inherently prioritize uploading the smallest gradients to the server. The MA algorithm achieves slightly lower delay than our proposed MultiQ-Abs scheme, as it tends to select smaller sub-models for transmission.

FIGURE 5. - Comparison of selection schemes in single-class and multi-class queue models.
FIGURE 5.

Comparison of selection schemes in single-class and multi-class queue models.

Similar to the unimodal case, statistical data heterogeneity can significantly affect the performance of the trained HAR model by resulting in biased models. This challenge becomes particularly pronounced when local datasets exhibit an imbalanced distribution, which can degrade the overall model performance. In this regard, we consider three major cases of data heterogeneity to illustrate the effect of device-modality selection over model performance: i) non-IID heterogeneity, where users possess non-IID data with the same number of samples, e.g., perform activities at different paces, ii) label heterogeneity, where users possess different activity classes (LabelHet), and iii) sample heterogeneity (SampleHet), where users have uneven numbers of samples. For the sake of fair comparison, we use an equal number of total data samples in the network for all the aforementioned scenarios.

To this end, Fig. 6 shows the accuracy of the MFL for the late-fusion structure in full participation of users (i.e., N^{\text {QoL}} = 20 ), and partial participation (i.e., N^{\text {QoL}} = 3,6,10,15 ). As shown from this figure, data heterogeneity in local datasets results in fluctuations in learning curve of the MFL, particularly when a small portion of users are selected. Additionally, increasing the number of participants reduces the fluctuations in the training curve resulting in a more smooth learning curve. However, larger number of participants can lead to an increase in the corresponding delay in the local devices for resource-constrained communication networks. We will discuss this issue for CF-mMIMO networks in the following subsection.

FIGURE 6. - Comparison of test accuracy versus communication round for different number of selected device-modality pairs (full selection: 
$N^{\text {QoL}} = 20$
 and partial selection 
$ N^{\text {QoL}} = 3,6,10,15$
), where local multi-modal datasets have different types of statistical heterogeneity. a) non-IID, b) LabelHet, and c) StatHet.
FIGURE 6.

Comparison of test accuracy versus communication round for different number of selected device-modality pairs (full selection: N^{\text {QoL}} = 20 and partial selection N^{\text {QoL}} = 3,6,10,15 ), where local multi-modal datasets have different types of statistical heterogeneity. a) non-IID, b) LabelHet, and c) StatHet.

3) CF-mMIMO Resource Allocation

Overall, system heterogeneity is the main source of communication desynchronization and hence significant delay in MFL. Particularly, latency is affected by three main variables: the number of selected clients, the selected device-modality pairs, and the availability and optimal allocation of communication-computation resources for the selected device-modality set. We conduct experiments to compare the performance of CF-mMIMO network and the proposed resource allocation schemes with conventional MIMO networks in MFL. Replacing the single base station in conventional MIMO systems with multiple APs in CF-mMIMO networks helps reduce the disparity between the channel condition of devices across the network and potentially reduce latency in MFL. Fig. 7 illustrates the average communication channel gain across the network for different number of APs, where an equal number of transmitting antennas across the network (i.e., N \times P=64 ) have been employed for the sake of fair comparison. As it can be observed from 7(a) and (b), CF-mMIMO networks provide more uniform channel gain across the network compared to centralized MIMO network. In addition, Fig. 7(c) shows that while the average channel gain across the network remains consistent as the number of APs (P) increases, the variance of channel gain, i.e., variation among channel quality of devices, is reduced.

FIGURE 7. - Comparison of channel quality between conventional MIMO and CF-mMIMO networks: a) centralized MIMO, b) CF-mMIMO with four APs, and c) boxplot of average mean and variance of channel across the network versus the number of APs. For fair comparison, the total number of transmitting antennas in the network is set equal to 
$N \times P = 64$
 for all cases.
FIGURE 7.

Comparison of channel quality between conventional MIMO and CF-mMIMO networks: a) centralized MIMO, b) CF-mMIMO with four APs, and c) boxplot of average mean and variance of channel across the network versus the number of APs. For fair comparison, the total number of transmitting antennas in the network is set equal to N \times P = 64 for all cases.

At each Stage (S1) of a communication round our proposed PSO scheme optimizes the communication and computation resources in problem {\mathcal {P}}1 to further reduce the latency. Accordingly, to observe the effect of the uniform channel quality in CF-mMIMO networks on the latency of MFL, Fig. 8 (a) and (b) demonstrates communication desynchronization and latency versus the number of APs, respectively. We compare our proposed PSO algorithm with the baseline case where users transmit with maximum power in the uplink transmission mode. In this experiment, a fixed number of total antennas P \times N = 64 is considered, with the number of selected clients set to N^{\text {QoL}} = 10,15,30 . Accordingly, P=1 indicates the centralized MIMO setting with 64 transmitting antennas, and P = 64 is the CF-mMIMO setting with 64 APs and 1 transmitting antennas per AP. The simulation results shows that distributing the centralized antennas across the network (i.e. increasing the number of APs) results in a reduction in both communication desynchronization and latency. Additionally, a larger number of participants is associated with higher communication desynchronization and delay in MFL. Our proposed PSO algorithm reduces desynchronization and latency by an average of %19 and %23.3, respectively.

FIGURE 8. - Comparison of (a) communication desynchronization and (b) latency versus the number of APs between the proposed PSO-based resource allocation algorithm and baseline with maximum power transmission. In both figures, the total number of antennas 
$P \times N = 64$
 is fixed.
FIGURE 8.

Comparison of (a) communication desynchronization and (b) latency versus the number of APs between the proposed PSO-based resource allocation algorithm and baseline with maximum power transmission. In both figures, the total number of antennas P \times N = 64 is fixed.

To further validate the advantages of CF-mMIMO and our resource allocation algorithm for MFL, we examine the number of participants involved in the MFL process under a fixed communication round deadline, t^{\text {th}}= . In this scenario, as outlined in Algorithm 1, device-modality pairs are added to the MFL process until no further pairs can be included without exceeding the resource constraints. Fig. 9 illustrates the number of device-modality participants as a function of the number of APs in the MFL process, comparing the case of maximum transmission power with the case utilizing our proposed PSO-based resource allocation. As shown in the figure, conventional centralized MIMO systems (P=1 ) exhibit the smallest number of participants. However, as the number of APs increases, the average number of selected participants also grows, demonstrating the scalability and efficiency of CF-mMIMO networks in accommodating more participants in the MFL process. The increase in the number of device-modality participants contributes to enhanced model performance by leveraging diverse and comprehensive data representation across modalities.

FIGURE 9. - Comparison between the average number of participants in the MFL framework versus number of APs for the proposed PSO algorithm and without resource allocation.
FIGURE 9.

Comparison between the average number of participants in the MFL framework versus number of APs for the proposed PSO algorithm and without resource allocation.

SECTION VI.

Conclusion

Our proposed framework integrates a late-fusion strategy with a device-modality selection method and a resource allocation scheme using a modified PSO algorithm, effectively addressing the data and system heterogeneity challenges in MFL over CF-mMIMO network. By employing late-fusion, our approach ensures flexible and robust model performance even with missing modalities which is crucial for HAR applications. The experimental results demonstrate that our fusion model outperforms baseline fusion methods achieving higher accuracy with 15% and 23% improvement in test accuracy when missing one and two data modalities in the inference phase, respectively. Additionally, the proposed device-modality selection and resource allocation scheme effectively minimizes the disparity between modality-specific sub-models and reduces communication delays in each round. Our results demonstrate the superiority of CF-mMIMO networks over conventional systems in addressing system heterogeneity, achieving reduced completion times for the MFL process. Future work will explore techniques to mitigate modality-specific data heterogeneity for the late-fusion model, and further optimization of resource allocation to improve scalability and efficiency in diverse application scenarios.

References

References is not available for this document.