Introduction
Large-scale machine learning models deployed in real-world scenarios often require training data sharing on a centralized server, where the actual model optimization takes place. Federated Learning (FL) was initially introduced in [155] as an alternative approach to train a global model across multiple devices while preserving the privacy and decentralization of their respective data.
In particular, machine learning methods for computer vision heavily rely on collecting and storing a huge amount of annotated image data on a central server. Centralizing such data necessitates the transfer of a significant volume of information, resulting in substantial communication overhead. Moreover, centralized data storage poses risks to user privacy and confidentiality, and recent regulations on data privacy prohibit the uploading of sensitive local data to centralized data centers [214].
Federated learning has emerged as a promising solution to address these challenges, as it enables on-device training of visual models. In FL, data remains localized on individual devices, and the collaborative training process involves exchanging model parameters instead of raw data. This approach opens up practical applications and opportunities for privacy preservation and effective management of sensitive data, such as medical images or facial pictures [103], [127], [258]. The fundamental framework, initially introduced by McMahan et al. [155] and depicted in Fig. 1, has been extended in various directions and explored in many works (refer to Section III for more details). Most studies on federated learning focus on its theoretical and communication aspects [2], [105], [128], [145]. Nevertheless, Federated Learning has recently attracted a wide interest when applied to computer vision tasks, ranging from image classification [26], [88], [269] to semantic segmentation [24], [62], [161], and object detection [96], [147], [200].
Standard federated learning setting for vision applications, whereas devices are heterogeneous among each other in terms of computational capacity, number of image samples and statistical distribution of data.
As already pointed out, real-world computer vision settings typically deal with huge amounts of data and often with critical privacy issues, and the distributed and privacy-preserving nature of FL makes it an extremely good candidate to solve these problems. As an example FL sparkled a large interest in human-centric tasks like face recognition [3], [157] and medical imaging [18], [129], [192], [236], [238], where privacy is a key requirement.
After discussing the challenges that arise in Section II, we overview the basic FL theoretical frameworks in Section III, while the different settings for FL in computer vision will be detailed in Section IV. The main challenges can be tackled using different techniques, e.g., knowledge distillation, representation and prototype learning and different aggregation strategies. Section V introduces the main insights and ideas that allow to efficiently tackle vision tasks in a federated learning environment. This section empowers readers to focus on the methodologies that align with their interests and encourages further exploration of specific works in the subsequent sections. Many different approaches apply these ideas to well-known computer vision problems like image classification, object detection, semantic segmentation and face recognition and it is widely used also in the medical imaging field. The most relevant approaches are presented in Section VI, while some benchmark comparisons can be found in Section VII. In Section VIII, we provide an overview of potential current and future trends in Federated Learning and finally in Section IX we draw the conclusions.
Main Challenges in FL
Compared to standard supervised learning, the Federated setting introduces new challenges: the clients usually have different hardware capabilities (system heterogeneity), a different amount of samples to process (data imbalance), and their data statistics may be different (statistical heterogeneity). Furthermore, the system should be efficient in terms of communication. Finally, any proper FL algorithm must not break the privacy preservation of clients’ data.
Summarizing, the main challenges are:
Statistical heterogeneity: clients’ data is highly non-IID, i.e., their statistics may not be indicative of the global distribution as they reflect the specific clients’ usage.
Model heterogeneity: clients can have different models.
Communication cost: transmitting model weights from clients to server determines a latency in training and sharing a huge quantity of information could cause network overloading [74]. The communication bottleneck issue can also be enhanced by the devices’ limited or intermittent connectivity due to battery consumption constraints, faults, or data unavailability.
Convergence time: reducing the time required to complete the training is a key target.
Privacy and Security: client-server communications should not contain sensitive information and FL systems should prevent the server from accessing clients’ local data [21], [127].
Catastrophic forgetting: inconsistent predictions can arise between subsequent training rounds [117].
Unlabeled data at clients: in some settings assuming that clients have access to ground truth data is not realistic.
Overview of Federated Learning
The first approach for FL, that represent also the baseline algorithm, is Federated Averaging (FedAvg) [155]. FedAvg uses a client-server architecture to perform collaborative learning in synchronous rounds. The server (or aggregator) broadcasts the global model’s current parameters to some of the clients at the start of each round.
Each participant locally trains the model on its own private data and sends back the updated model parameters to the server. The server collects these updates and combines them using a specified strategy, i. e., a weighted average based on the amount of local data each participant has. The combined updates are applied to the global model as a “pseudo-gradient” [180]. This process can be repeated for multiple rounds of FL by distributing the updated global model to participants. A summary of the pipeline is shown in Fig. 2.
Despite proving solid empirical results in IID and balanced settings, FedAvg performances degrade when dropping this assumption [269]. Subsequent research works took into consideration that heterogeneous data characterize real-world federated learning and simulate this scenario by the use of realistic per-user data splits [88], [269]. To this extent, they address two types of distribution shift: non-identical class distribution, where the visual distribution of classes differs by device, and imbalanced client data sizes, where the number of samples available for training varies for each client.
FedProx [128] can be viewed as a generalization and re-parametrization of FedAvg. Theoretically, FedProx provides convergence guarantee when learning over data from non-identical distributions (statistical heterogeneity), and while adhering to device-level systems constraints by allowing each participating device to perform a variable amount of work (systems heterogeneity). Different devices in federated networks often have different resource constraints in terms of computing hardware, network connections, and battery levels. Therefore, it is unrealistic to force each device to perform a uniform amount of work (i.e., running the same number of local epochs), as in FedAvg. FedProx allows for variable amounts of work to be performed locally across devices based on their available systems resources and then aggregates the partial solutions sent from the stragglers (as compared to discarding updates from these devices). Moreover, a proximal term has been added to the local subproblem to effectively limit the impact of local variable updates.
SCAFFOLD [105] aims for a faster convergence and for a reduction of the so-called “client-drift” in local updates. SCAFFOLD estimates the update direction both for the server model and for each client and the difference between the two is then an estimate of the client-drift which is used to correct the local update. It can be seen as an improved version of [190], introduced for distributed parallel optimization.
MIME [104], extends SCAFFOLD to all types of functions and applies global momentum locally since it proved to be more effective than using server-only-momentum strategies. While MIME has demonstrated good performance, it can be detrimental to training efficiency as it requires computing the gradient twice at each local step [220].
AdaBest [212], proposes an adaptive algorithm that estimates drift across clients, using less storage and communication bandwidth, as well as lower compute costs. Additionally, it improves stability by constraining the norm of estimates for client drifts, making it more practical for large-scale FL.
FedNova [219] provides a general framework to analyze the convergence of heterogeneous federated optimization algorithms. The authors focus on understanding the solution bias and the convergence slowdown due to objective inconsistency. Moreover, FedNova presents a normalized averaging method that aims to eliminate objective inconsistency while preserving fast error convergence. They claim that sophisticated approaches such as FedProx [128] and SCAFFOLD [105], designed to handle non-IID local datasets, can be used to reduce (not eliminate) objective inconsistency to some extent, but these methods either result in slower convergence or require additional communication and memory resources. In FedNova the locally normalized updates (that are just re-scaled versions of cumulative local changes) are averaged instead of the local changes.
The insights introduced by these works have then been applied in many papers targeting federated learning for different vision tasks, ranging from image classification to object detection and semantic segmentation, and again to more human-centric tasks such as face recognition and medical imaging. The various FL approaches for vision applications are detailed in Section VI.
Federated Learning Settings
This section overviews the main FL settings of interest in computer vision.
Standard FL: the standard setting of FL with non-IID data scattered across different clients [269], with the objective of training a single global model.
Heterogeneous FL: clients can have different models, different computation resources and different communication capabilities [48].
Personalized FL: in standard FL client data are used to train collaboratively a global model. In personalized FL instead, each client aims at optimizing a specific model to be deployed on its own data [76].
Clustered FL: in this setting subsets of clients share some common characteristics, i. e., their data belong to the same subdomain, and they can be grouped with the objective to build a personalized model for the cluster that works well in the corresponding subdomain [25], [60], [64], [69], [188], [193].
Continual FL: Continual learning enables a model to learn from a never-ending stream of data, without the need to retrain the model from scratch every time new data or new tasks become available. Class-incremental continual learning, allowing clients to learn from non-stationary data and learn new tasks over time has been explored in the FL setting [50], [51], [90], [149], [247], [252]. An asynchronous federated continual learning (AFCL) setting has been introduced in [194], where the learning of multiple tasks happens with different orderings and in asynchronous time slots.
Federated Domain Adaptation: Domain adaptation (DA) deals with the statistical heterogeneity between a source dataset, used for training, and a target dataset to which the model should be adapted. It aims at transferring the learning performed on source data to the target set. Interesting DA settings are UDA (Unsupervised DA), where no supervision on the target dataset is available and SFDA (source-free DA) where source data is not available during adaptation.
Within the context of FL, DA has been approached from diverse perspectives to target various applications, e.g. face recognition [276]. Examples include its integration with adversarial learning techniques [175], the consideration of individual clients as distinct target domains, denoted as multi-target DA [245], the use of each client as a source domain with an additional target one [98], and finally in the SFDA setting [193].
Methods
The purpose of this section is to present diverse insights and ideas by emphasizing their broad utility within the context of Federated Learning. We aim to offer readers a comprehensive grasp of the prevailing ideas and concepts, laying the foundation for the subsequent task-specific discussion in Section VI. Since FL needs to solve a wide range of challenges in many different settings, a large number of different strategies have been proposed both for the local training procedures on the clients and for the model aggregation at server side. We present the main families of strategies and we briefly discuss how the different provisions are implemented into FL approaches. In Figure 3 an overview of the family of strategies is presented, grouping them into client and server-side techniques.
Overview of the main strategies employed at client and server side for Federated Learning.
Pre-training is a transfer learning technique that has been widely used to reduce training time and improve final accuracy in large-scale deep learning. Even though the standard FL setup [155] does not consider an initial pre-training step at server side, the nature of FL makes it the perfect candidate for such strategies. Starting from a pre-trained model significantly reduces the impact of heterogeneity of data (data is IID during pre-training while clients have non-IID data), thus allowing to train clients with more local epochs (and less rounds) since the client drift is more limited [170]. This enables the learned global models under different clients’ data conditions to converge to the same loss basin and makes global aggregation more stable.
The first systematic study on pre-training for FL is [34], which uses five different image datasets. They consider two cases, when the server has a pre-trained model or data for pretraining from real-world dataset (e.g., Imagenet), and when the server has no data, resorting to image generation techniques such as random generative models or fractals [7], [106]. Fractal pre-training is also used in [194].
Another approach is to force clients to jointly learn to fuse the representations generated by multiple fixed pre-trained models rather than training a large-scale model from scratch [204]. Pre-training on a supervised source dataset at the server side can also be used to tackle real-world setups, where clients have unsupervised data and require adaptation [193], [245].
Image Augmentation and Style Transfer techniques have been used to improve domain generalization in FL [35], [62], [142]. Data augmentation techniques can be used to improve generalization thus mitigating issues due to data heterogeneity on the clients, and in addition, allow to improve accuracy on new unseen clients [46]. Furthermore, in clustered FL clients can be clustered according to their style to improve performances [193].
In a practical FL application, the models trained with local datasets are likely to establish decision rules on biased attributes (i. e., fur color for animals), which hinders the aggregated model’s ability to learn a suitable representation for classification. Hence, [239] propose to learn a Bias-Eliminating Augmenter at each client, able to generate bias-conflicting samples thus reducing the bias in local updates.
Knowledge Distillation (KD) was introduced in [84] for model compression: it allows to transfer knowledge from a larger network (teacher) to a smaller one (student). It has been widely used in continual learning and recently has been increasingly employed in FL algorithms [162] to reduce catastrophic forgetting, tackle data heterogeneity, and enable model heterogeneity.
At client side, KD can be used to tackle catastrophic forgetting, by forcing the prediction of the current local model to be consistent with the global model of the previous step, preserving inter-class semantic consistency across different incremental tasks [50], and balancing knowledge from others while boosting both inter and intra-domain performance [90].
KD methods can address data heterogeneity, both at client and server sides. At the client side, global knowledge is used to control the client drift via on-device regularizers [80], [81], [117], [246] or using synthetically-generated data [275]. On the server side, instead, the global model can be rectified via ensemble distillation of a proxy dataset [32], [138], [187] or using a generator network [215], [262], [263].
Finally, KD can be used without resorting to a public dataset or a generative model at the server, by applying it on averaged data representation and soft predictions (referred to as “hyperknowledge”) to improve both personalized and global model performances [31].
Representation Learning techniques aim at improving a downstream task by enforcing meaningful distribution of the training data representations (features).
A first set of approaches aims at aligning features across clients during local training to address data heterogeneity on clients [271] and domain generalization [261]. Contrastive Learning is used in [204] to assist local training and achieve higher model performances [124], [166], [259].
The permutation invariance property of neural networks leads to neuron misalignment across local models, therefore [133] binds neurons in positions and pre-aligns parameters for better coordinate-wise parameter averaging, while [218] matches the neurons of client models before averaging them, and permits global model size adaptation.
A different approach is proposed in [52], where the non-IID issue is tackled by constraining learned representations of data points to be on a unit hypersphere shared by clients.
A feature-oriented model structure adaptation method is exploited in [254] to ensure explicit feature allocation. Applying the structure adaptation to collaborative models, matchable structures with similar feature information can be initialized at the very early training stage. Then, during the federated learning process, a feature-paired averaging scheme is used to guarantee aligned feature distributions and avoid feature fusion conflicts under either IID or non-IID scenarios.
Prototype Learning techniques aim to learn a compact representation of features, called prototypes, which can be used as representatives of target classes. Hence the large adoption of these methods for continual learning. In FL, a first possible strategy is the correction of the client drift by computing client deviations using margins of prototypical representations learned on distributed data. These margins can be exploited to drive the federated optimization, via an attention mechanism [161], to address system and statistical heterogeneity. Prototypes can also be transmitted instead of model weights [203], to reduce communication cost, to allow clients to learn a more customized local model and to be more robust to gradient-based attacks [28], [273] since high-level statistic information (prototypes) are more privacy-compliant than raw features. Prototype-based contrastive losses can be used to make local objectives consistent with the global optima and tackle the non-IID data issue [166]. Finally, since global prototypes could be biased towards the dominant domain distribution in the presence of numerous domains (domain shift), a combination of cluster prototypes and averaged prototypes can be employed [91].
Batch Normalization (BN) is a key tool in deep learning. However, standard batch normalization in FL is not very effective (according to [54] this is due to the fact that statistics of channels change significantly across clients). Various customized versions of BN layers have been proposed for federated learning empirically showing better performance.
Firstly, it is possible to update the client batch-norm layers locally without communicating them to the server, in order to achieve personalized FL [132]. Local-statistic batch normalization (BN) layers can be exploited, resulting in collaboratively-trained, yet center-specific models [8]. This strategy improves robustness to data heterogeneity while also reducing the potential for information leaks by not sharing the center-specific layer activation statistics. Group normalization (GroupNorm) can improve the convergence of FL [86]. Nevertheless, GroupNorm is an instance-based normalization scheme, which is highly sensitive to the noise on data samples. The privacy concern can be tackled by not tracking running estimates and simply normalizing batch data [48].
Model Aggregation: variations of the standard FedAvg [155] aggregation have been introduced for a variety of reasons. With FairAvg [160] each user contributes equally to the aggregated model (simple mean), increasing accuracy and convergence rate. Layered-wise aggregation schemes are used to enable personalized [153] and clustered FL [193]. A selective model aggregation scheme has been used to reduce the influence of varying image quality and computation capabilities in vehicular clients [248]. The work in [29] presents elastic aggregation as a novel approach in federated learning for addressing gradient dissimilarity in heterogeneous scenarios. By leveraging parameter sensitivity, this technique improves convergence behavior and enhances the effectiveness of federated learning. In [264], to enable domain generalization to unseen domains, the global model dynamically calibrates the aggregation weights minimizing the variance of the generalization gap. In [248] FL is applied to Vehicular Edge Computing, and a selective model aggregation approach is proposed to reduce the influence of the diversity of image quality and computation capability of the different vehicular clients. Finally, FedFusion [55] introduces a variational autoencoder method for learning the optimal parameters of distribution fusion components based on observed information. These parameters are then utilized to optimize the federated model aggregation in the presence of non-IID data. Notice that, despite the server not having direct access to private data, there are statistical characteristics embedded within the received model parameters (normalization layers) from which the server can infer the local distributions.
Privacy and Security are implemented in FL by enabling local training without the need for the exchange of critical data between the server and the clients. This safeguards the clients’ data from potential eavesdropping by hidden adversaries. However, it is important to note that adversaries may still be able to collect private information by analyzing the differences in trained network weights or other parameters transmitted by the clients [150], [196], [226]. To mitigate the risk of information leakage, a first solution is to perturb in some way the transmitted data to make harder the eavesdropping task without affecting too much the model performances. Techniques for this task include differential privacy (DP) [56], additive perturbation and multiplicative perturbation [251]. As an example, the authors of [227] propose a framework based on differential privacy in which each client adds noise to its locally trained parameters before uploading them to the server for aggregation. An alternative is to use cryptographic techniques such as homomorphic encryption [183], secret sharing, and secure multi-party computation. For example in [151], the model updates undergo encryption using an aggregated public key before being shared with the server for aggregation. To decrypt the updates, collaboration among all participating devices is necessary. This approach demonstrates resilience against potential attacks from the participants and collusion attempts between the participating devices and the server. Many other approaches for this task have been proposed, for comprehensive surveys focusing on privacy and security in FL see [165] and [251].
FL in Computer Vision: Tasks and Approaches
FL has shown great potential in computer vision applications, particularly when dealing with large and sensitive datasets, and it has been applied to a variety of computer vision tasks. However, each of these tasks poses unique challenges when it comes to federated learning. For instance, semantic segmentation requires pixel-level annotations, which are usually scarce and expensive to obtain. Additionally, the distribution of the data across different clients may be heterogeneous, which can lead to performance degradation if not addressed appropriately. Despite these challenges, recent works have demonstrated promising results in applying federated learning to computer vision tasks [24], [62], [161], [193], and ongoing research aims at improving the performance and scalability of FL in this domain. An overview of the distribution of the papers on the different tasks is shown in Fig. 4.
A timeline that illustrates increasing research focus on these vision tasks each year. The plot shows the number of papers in each year for each of the 5 tasks in Section VI. Notice that we assigned papers tackling face recognition and medical imaging to these tasks independently of the underlying computer vision task.
Throughout this section, we address the different tasks (i.e., the three main image understanding tasks - Classification, Object Detection and Semantic Segmentation - plus two widely considered application fields, Face Recognition and Medical Imaging) and present the main approaches developed to tackle each of them. This structure allows to focus on specific application domains and facilitates an in-depth exploration of the strategies presented across different research papers. We decided to subdivide the works on the basis of the considered task since, although many FL techniques can be applied in principle to different tasks, the majority of current research works focus on a single vision task. In the remainder of this section, we present the various approaches, while some performance comparisons will be shown in Section VII.
A. Image Classification
Image classification, i. e., classifying an entire image into one of the possible semantic classes, is a fundamental task in computer vision. Several backbone models have been used for federated image classification tasks, from the widely used ResNet [79] model to lightweight architectures. As an example, MobileNet [186] runs efficiently on mobile devices with limited computational power, thus being well suited for real-world FL settings. Simple datasets have been widely used such as the 10 classes color images CIFAR-10 [112] dataset, its extended version with 100 classes (CIFAR-100) or the grayscale images of handwritten numbers from MNIST [116] or EMNIST [42]. The data inside these datasets need to be split in order to simulate the non-IID distribution across clients, which is a key characteristic of FL data. To this aim, FL researchers have proposed new data splits to simulate real-world scenarios. As an example, Zhao et al. [269] suggest two partitions of MNIST and CIFAR-10: in the first case each client receives data belonging only to a class, while in the other each client is randomly assigned 2 partitions from 2 classes.
1) Tackling Statistical Heterogeneity
As for most tasks, the impact of non-IID data at the clients is one of the key issues to be tackled by approaches targeting image classification in the federated setting. The work of [269] shows that the accuracy is reduced significantly when data is non-IID, which can be explained by the model weights divergence at each device. As a solution, they propose sharing a small subset of image data across all devices. A simple modification of FedAVG, called FedAvgM [87] addresses the non-IID issue by introducing momentum at the server side. The work of [182] introduces a FL framework robust to affine distribution shifts. They propose a fast and efficient optimization method and provide convergence and performance guarantees via a gradient Descent Ascent (GDA) method. In [124] the authors address the non-IID issue observing that the global model trained on a whole dataset is able to learn a better representation than the local model trained on a skewed subset. They propose model-contrastive learning, to force the representation learned by the local and global models to be aligned. A probabilistic FL framework is considered in [257], where the global model is built with a Bayesian non-parametric strategy that allows the local parameters to match existing global ones, otherwise new global parameters are created if existing ones are poor matches. FedAlign [156] studies the data heterogeneity challenge of FL focusing on local learning generality. They resort to a distillation-based regularization method to align the Lipschitz constants, promoting smooth optimization and consistency within the model. A general multi-stage FL framework (FedCorr) is proposed in [237] to tackle data heterogeneity, with respect to both local label quality and local data statistics. More specifically, an adaptive local proximal regularization term based on estimated local noise levels is introduced. The work of [133] focuses on solving the non-IID challenge in FL limiting the misalignment across local models. FedPVR [120] starts from the observation that there is a stronger misalignment between models in the last few layers of the network compared to the rest of it. To address this issue, the authors proposed a partial variance reduction that focuses on aligning the local models specifically in the last layers (differently from SCAFFOLD which performs it on all layers).
2) Tackling Model Heterogeneity
In real-world scenarios, clients exhibit variations in their computational capabilities, leading to discrepancies in the complexity of their models. Therefore, several studies have been conducted to enable FL in the context of such heterogeneous models. Lin et al [138] propose an ensemble distillation for model fusion (FedDF). The server collects the class scores of the public dataset on each client model and calculates the average value as the updated consensus.
A scenario with noisy and heterogeneous clients is considered in [61]. As in the previous work, public data is used to deal with model heterogeneity, and a correction loss computes the optimal weighted combination of the selected client outputs, reducing the contribution of the noisy ones.
Diao et al. [48] present HeteroFL, a method to dynamically allocate a subset of global model parameters as local model parameters, considering the capabilities of the local clients. Ilhan et al. [92] introduce ScaleFL, a framework that adjusts the size of a deep neural network based on available resources. It achieves this by allowing the model to make predictions at different stages with early exit points. This enables the selection of the best-fit models for efficient training on distributed client devices. It also incorporates self-distillation, which involves leveraging predictions from different early exits to improve knowledge transfer among subnetworks.
3) Personalized FL
Some works instead of aiming at the construction of a unique global model, focus on the construction of personalized models optimized for the data of a specific client or set of clients. The work of [101] draws a connection between two widely used FL models and model agnostic meta-learning algorithms (MAML), and interpret existing FL algorithm in the light of existing MAML algorithms. They show empirically that FedAvg is already a meta-learning algorithm, optimizing for personalized performance, as opposed to the quality of the global model. A model trained using a standard centralized optimization method is harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. However, concentrating solely on personalization might lead to a biased personalized FL (pFL) result, where clients with lower performances suffer from the large client deviation. For this reason, in [179] pFL is addressed by dividing the layers into personalized and universal, where the personalized layers extract personalized attributes and the universal layer universal information. Unlike most of the previous work, in [33] the performance is improved simultaneously for pFL and standard FL, showing that strong personalized models emerge from the local training of generic FL algorithms due to implicit regularization. Furthermore, when clients have non-IID distributions, class-balanced objectives can further improve FL performances. In [218] a layer-wise federated learning algorithm is designed to account for permutation invariance of the neurons and permits global model size adaptation. Ma et al. [153] propose a layer-wise aggregation policy to enable pFL among heterogeneous clients. Gao et al. [67] introduce lightweight modifications in the training phase to decouple the global model from clients’ local models using the local drift, improving the robustness and speed of model convergence. Liang et al. [135] present a novel federated semi-supervised method, to address the uneven reliability of non-IID local clients. Instead of aggregating directly local clients, they propose the concept of updating the global model via aggregating multiple sub-consensus models. Inside a sub-consensus model, they propose a novel distance-reweighted model aggregation (DMA) module, which dynamically adjusts the weights of each sampled local client to the sub-consensus model.
In [272], the authors present a hierarchical Bayesian modeling and variational inference algorithm that offers a closed-form estimation of a confidence value. This confidence value takes into account the uncertainty of clients’ parameters and the local model deviations from the global model. During the aggregation stage, the confidence value is used to weigh clients’ parameters and adjust the regularization effect of the global model. Similarly, [178] estimate the uncertainty according to the performance of each client and perform aggregation by selecting highly reliable clients.
4) Representation’s Alignment
A widely used strategy to align the learned representations across the different clients is to introduce additional losses and constraints that work on the internal feature representations trying to regularize them across multiple clients. In [261], an adversary module is proposed to reduce the divergence in feature representation among different clients and two consensus losses are proposed to reduce the inconsistency in optimization objectives from two perspectives. In the pursuit of a structure-feature alignment across the collaborative model, [254] designs a feature-oriented model structure adaptation method to ensure explicit feature allocation in different neural network structures. Then, they propose a feature-paired averaging scheme to guarantee aligned feature distribution and maintain no feature fusion conflicts. Zhou et al. [271] demonstrate how data heterogeneity leads to a vicious cycle between classifier divergence and feature inconsistency across client models. To break it, they leverage feature anchors to align features and classifiers across clients. All client models are updated in a uniform feature space with corresponding classifiers.
5) Optimizing Communication Cost
Another set of works focuses instead on the communication aspects of FL and aims at reducing the network communication resources usage and at the same time enforcing privacy by avoiding transmitting sensitive data. A new FL framework exploiting knowledge distillation that can preserve privacy by design, while also consuming substantially less network communication resources w.r.t. baseline methods is proposed in [71]. An extensive empirical study is performed in [170] by comparing 12 variations of federated optimization methods on three commonly-used FL benchmark dataset, with the objective of understanding how the initialization impact the behavior of federated optimization methods. They found that a pre-trained solution can close the gap between training on IID and non-IID data and that initializing FL with a pre-trained model can increase final model accuracy and reduce the number of rounds required to achieve a target accuracy. Tan et al. [203] present a novel FL method that improves communication efficiency in the heterogeneous setting by transmitting prototype representations instead of model weights, further improving privacy. They propose also a novel prototype-based aggregation. Isik et al. [94] propose a FL framework where instead of model weights, clients train a stochastic binary mask used to sparsify the dense network with random weights, yielding a communication efficient solution, faster convergence and higher accuracy. Similarly, in [137] sparsity-aware training on clients is used to reduce both communication and computational costs. Zhang and Hanzo [260] propose a FL-aided multi-UAV system for classification optimizing the communication cost.
An image classification dataset designed for FL is presented in [95], which is close to a real-world scenario where each client has a unique dataset under domain shift. They propose two algorithms dealing with two topologies (cycle and star) which result to be more communication efficient. Instead of sending model weights to the server for aggregation, [235] propose generating synthetic data on each client to mimic the loss landscape of the original data through distribution matching. This approach decreases the number of communication rounds and enhances model quality by transmitting more informative and smaller data.
6) Faster Convergence
Another key aspect is the time required for the convergence of the model. Hsu et al. [88] propose a framework to speed up convergence for imbalanced clients. The idea is to conceptually split large clients into multiple smaller ones, and repeat these small clients multiple times such that all virtual clients are of similar sizes. In addition, to avoid underutilizing training examples from large clients, the probability that any client is selected for a round is proportional to its data size. Tang et al. [206] formulate the goal of accelerating the convergence of FL as optimization problems that maximize the posterior expectation of loss decrease. They utilize Gaussian Processes to solve the optimization problem and obtain an effective client selection strategy for heterogeneous FL. Sun et al. [201] design a synchronization scheduler that can reduce the wasted waiting time of the server and improves the time efficiency in realistic settings. Moreover, they mitigate the accuracy drop by applying the semi-asynchronous protocol and enable extremely lagging devices to contribute to the global model training.
7) Ensuring Privacy and Security
Local adaptation techniques (fine-tuning, freeze base, multi-task learning, knowledge distillation) that an individual participant can use to mitigate the damage from privacy and robustness mechanisms are investigated in [256]. An hybridization of meta-heuristic methods with FL is proposed in [177], and analyzed in terms of the accuracy of the general model as well as for security against poisoning attacks. Shi et al. [195] observe that the loss landscapes are sharper in the presence of differential privacy. Therefore, they propose to use Sharpness Aware Minimization in FL like in [24], in combination with differential privacy to generate local flat models with better stability and weight perturbation robustness. Other works focusing on attacks and security are presented in [122], [125], [126], [134], [202], [216], and [268].
8) Catastrophic Forgetting
The phenomenon of catastrophic forgetting (CF), typical of continual learning, is manifested also in federated learning from the presence of diverse data and incomplete participation, ultimately undermining its performance [148]. In particular, [118] formalizes the concept of “local client forgetting”, showing that when a client performs local updates, it risks to overly optimize his local objective and to forget knowledge about the other subsets of data. They address this problem at client side, by re-weighting the softmax logits before computing the loss.
Another approach GradMA [148], addresses CF both at the client and server side, by correcting the update directions of the server and clients simultaneously. They resort to quadratic programming and memorize updates to correct the update directions. They propose also a memory reduction strategy for a practical FL scenario with a large number of clients.
9) Non-Standard Settings
Other works focus on non-standard settings where data is unlabeled or made available in an incremental way. Huang et al. [90] make use of unlabeled public data and adopt self-supervised learning, to enable heterogeneous models communication and learn a generalizable representation. To alleviate catastrophic forgetting in FL, inter and intra-domain knowledge distillation is used.
Li et al. [123] focus on federated semi-supervised learning, assuming that only a subset of clients is fully supervised. They propose a pseudo-labeling strategy to handle the catastrophic forgetting problem and a class-balanced adaptive threshold selection to refine them.
Among the various approaches considering unlabeled data, the federated active learning framework has emerged as a promising solution. In [4], a simple active learning-based FL framework is presented to utilize unlabeled samples at clients for training local models in two real-world applications: natural disaster and waste classification. Whilst, [108] introduced a novel query sample strategy that utilizes local-only and global models to ensure both sides of inter-class diversity.
In [50] the authors address the problem of federated class incremental learning, i. e., a global class-incremental model is trained in the FL setting. They identify the catastrophic forgetting in this scenario to be distinguished in local and global. A class-aware gradient compensation loss and a class-semantic relation distillation loss are used to address local forgetting, while a class-semantic relation distillation on the local clients is used to compensate for global forgetting. Moreover, a prototype gradient-based mechanism is implemented to protect communication.
B. Object Detection
Object detection consists in localizing - typically with a bounding box - and labeling every single object of interest into an image. Luo et al. [147] introduce a real-world dataset collected from street cameras that reflects the characteristics of the federated settings, namely that it is non-IID and imbalanced. They propose two configurations: in the first, each camera is a client (Street-20); in the latter, cameras are clustered in nearby areas (Street-5). Finally, they test their benchmark using a modified version of FedAvg. Liu et al. [143] developed a platform (FedVision) that enables end-to-end collaborative training using personalized and locally stored datasets from different clients. They implement FedYOLOv3, which fits the YOLOv3 [181] object detection model into the FedAvg framework. The local parameters are sent in a compressed and encrypted manner to the server. A similar strategy is implemented in [96], considering an autonomous driving FL scenario, where they tested their framework on four clients using the KITTI dataset [68]. In FedOD [200], a multi-teacher distillation and a weighted bounding boxes fusion scheme are exploited to provide each client with both global and personalized models. In the FedOD setup, two pairs of clients take data respectively from SODA10M [75] and NuScenes [23] datasets, while the server data comes from BDD100K [253], which is the largest one in order to mimic statistical heterogeneity. Chow et al. [40] present a systematic framework for protecting FL from attacks. The authors propose a spatio-temporal signature analysis to mitigate network failures resulting from errors inherent in spatial clustering of gradients.
C. Semantic Segmentation
Semantic Segmentation involves assigning a label to every single pixel in an image. State-of-the-art models utilize encoder-decoder architectures based on CNNs or transformers [37], [130], [144], [233]. These models have typically a huge number of parameters but lightweight architectures with smaller model sizes and lower computational complexity [77], [85], [186] have also been proposed. As for the other tasks distributed training solutions must be considered due to privacy and efficiency constraints. FedMargin [161] provides a distributed framework for both image classification and semantic segmentation. It estimates client deviations from the margin of class-conditional representations, and uses this information to drive the federated optimization by means of an attention mechanism. FedDrive [62] explores the application of federated learning to the problem of autonomous driving using data from multiple vehicles.
Federated Multi-target Domain Adaptation (FMTDA) [245] addresses the challenge of dealing with a limited number of clients having unlabeled target local datasets with dissimilar distributions. The approach leverages a labeled source dataset that is accessible on the server side, while simultaneously handling the aforementioned challenge in a federated setting. The authors pigeonhole their setting Multi-Target to highlight how the distributions of the target datasets of the clients are affected by statistical heterogeneity. Experimental evaluation is performed on four clients, where each client has all the images belonging to one among the four cities of CrossCity [39] or to one of the four domains of BDD100K [253]. Similarly, the Federated source-Free Domain Adaptation (FFREEDA), introduced in [193], considers a more realistic scenario with a large number of unsupervised clients, while the server has access to a source labeled dataset only for pre-training. Instead, the task of federated incremental learning for semantic segmentation is considered in [51]. Differently from previous works which focused on domain heterogeneity, [159] addresses the challenging class heterogeneity problem for semantic segmentation. They propose a modified cross entropy loss and a pixel contrastive loss to mitigate inconsistencies and client drift during local adaptation.
D. Face Recognition
Face recognition (FR) is a biometric task that compares and analyzes patterns to uniquely identify the face of a person. A worldwide discussion on AI ethics has been sparked by commercial applications based on face recognition and facial analysis techniques, leading to the release of governance guidelines and recommendations from various countries (e.g., EU’s General Data Protection Regulation [214]). Therefore, in most cases, personal devices are not authorized to access large amounts of facial data, and data sharing is not permitted. Even providing only internal network representations, as in other vision problems addressing FL, might result in privacy leakage since high-fidelity face pictures can be generated using ad-hoc ML techniques like DeepInversion [250]. The dataset themselves are subject to privacy violations: as an example, MS-Celeb-1M [73], a large-scale dataset containing 100K individuals, was taken off the Internet due to privacy concerns. Examples of publicly available datasets for this task are Labeled faces in the wild (LFW) [89], IARPA Janus Benchmark-A (IJB-A) [109], IJB-B [229], IJB-C [154] and Facial Faces in the Wild (RFW) [224]. As in other vision tasks, most of the approaches rely on pre-trained networks such as CosFace [217] and MobileFaceNet [38], which constitute efficient backbones for FR. Summarizing, face recognition requires model training and distribution techniques able to ensure privacy protection in light of the rising social agreement on data privacy and FL is a very valuable tool to achieve this goal.
The first work to introduce federated learning into a FR task has been FedFace [14]. The framework consists of a server, 3 trainers, and 2 validators. The server maintains a global momentum, which is evenly applied to each step of a training round. In addition, the aggregated models are validated on several parties, each of which has a private validation dataset, in order to dynamically discover the nearly optimal weightings for aggregation. FedFace [3] addresses the scenario in which each of the participating clients has photos belonging to a single identity. The server only communicates the parameters of the feature extractor and the class embedding, rather than the classification matrix which contains sensitive information. FedGC [171] aims to guarantee that each client holds private class embeddings. They propose a softmax-based regularizer to correct the gradients of the embeddings by injecting a cross-client gradient term. Without data sharing, FedFR [276] seeks to learn a model for the unlabeled target domain by adapting from a source domain. The source dataset is used to pre-train one client while on the other two clients, a hierarchical clustering algorithm is used to generate pseudo labels for the target data. FedFR [141] employs a globally shared dataset to regularize the local model training. Only “hard” global samples, i. e., those with cosine similarity greater than a threshold to any of the local data for training, are supplied in order to improve local personalization without increasing the computational load. In practice, a larger threshold corresponds to less global data. To identify a trade-off between personalizing and maintaining the local models’ similarity to the global model, contrastive regularization and feature customization are exploited simultaneously. In PrivacyFace [157], privacy-agnostic class prototypes are generated to prevent any specific individual in the cluster to be learned. Moreover, during the local optimization, a consensus-aware loss forces clients not to embed samples in inappropriate feature spaces (i. e., private clusters). Ding et al. [49] use transfer learning to speed up federated training on devices. They present an architecture in which a private projector helps to secure shared gradients without involving additional memory consumption or computational costs. Shang et al. [191] test the use of various loss functions. In particular, using pair-based methods such as Multi-Similarity Loss can be more accurate and communication-efficient when there are few classes for each client. However, when there are many classes on each client, using classification-based loss functions such as the CosFace [217] loss can improve the global model faster and with less communication cost.
E. Medical Imaging
Privacy issues are even more critical in the medical imaging field, where the data could disclose very sensitive information about the patient’s pathologies and medical history. In addition, although institutions and hospitals may gather the same kind of medical data, those data may have varying characteristics due to collection methods, standards and protocols, institutional policies and priorities, different patient population and different privacy regulations. The bottleneck of FL in dealing with multi-source decentralized medical imaging is the cross-client variance issue, which is exacerbated by the restricted amount of training data that is typically accessible for each client and the constrained number of clients.
1) Disease Classification
A FL framework based on FedAvg for detecting lung nodules in CT scans is proposed in [13]. In this work the task is modeled as a classification one. AFL method for medical imaging classification, that uses multiple domains of siloed data, i. e., data from different sites with different features and labels, is proposed in [8]. It exploits generative models for data synthesis, which allows the training of a deep learning model on the synthesized data from all domains. Roth et al. [185] investigate using FedAvg for a breast density classification task on real-world multi-institutional data. Tan et al. [205] use transfer learning to improve performances over the centralized approach. The method is evaluated on a dataset of mammograms for breast cancer diagnosis. Jimenez et al. [102] test their framework on breast cancer datasets as well. They exploit an ensemble of deep learning models and use a memory mechanism that allows the models to learn from the past experience of other models. Additionally, Li et al. investigate multi-site fMRI classification for the Autism Brain Imaging Data Exchange (ABIDE) dataset using federated learning [131].
2) Segmentation for Diseases Detection
Yan et al. [241] aim to reduce client-to-client differences by converting each client’s raw image data into a common image space using image-to-image translation techniques while still respecting FL’s privacy settings. The algorithm is validated on PROSTATEx benchmark [11] for prostate cancer detection. Fedcross [238] targets four public datasets (MSD [9], NCI-ISBI [140], PROMISE12 [20], and PROSTATEx [11]) gathered in various clinics with different MRIs. They train a global model sequentially across clients rather than performing aggregation. FedDG [142] learns a federated model from multiple remote source domains in order to directly generalize to unknown target domains. The data distribution information is shared across clients via an efficient continuous frequency space interpolation approach. Comprehensive studies on two common medical image segmentation tasks (retinal fundus image segmentation and prostate MRI segmentation) are performed. The narrowing of the generalization gap is also addressed in FedSM [236] which exploits personalized models that match different data distributions. A model selector then determines the closest model/data distribution for every test sample. The model has been evaluated on retinal disc/cup and prostate segmentation. Segmentation is addressed in FedDM [274] using bounding boxes as a form of weak supervision. Together with a collaborative annotation strategy, a hierarchical aggregation scheme is employed to mitigate local drift. The method is evaluated on polyp and prostate segmentation, using both magnetic resonance and endoscopic images. FedCE [99] provides an approach to compute client contributions from both the gradient and the data space and then recommends a fair global aggregation based on those estimates.
MRI brain tumor segmentation is a challenging task due to several factors, including differences in tumor size, shape, location, and appearance, as well as variations in imaging protocols and equipment across different institutions. These variations can result in differences in image quality, noise, and artifacts, making it difficult to accurately segment brain tumors using traditional machine-learning approaches. By training models on data from multiple sources, federated learning can help to account for variability in imaging protocols and equipment, as well as capture a broader range of tumor characteristics. Additionally, federated learning can help to reduce the risk of data breaches and protect sensitive patient information. Sheller et al. [192] are the first to test FedAvg on a multi-institutional dataset, the brain tumor segmentation dataset (BraTS) [158]. Li et al. use a similar method and added a privacy-preserving feature extraction technique to improve the model’s robustness [129]. A distillation-based strategy aimed at reducing communication resources requirements is used in [71].
3) MRI Reconstruction
Guo et al. propose a multi-center MRI reconstruction method, FL-MRCM, using a federated learning approach that maintained the privacy of the local MRI data [72]. They help the adaptation of the local models through adversarial alignment by sending global latent representations to a domain discriminator. In FedMRI [64] the MRI reconstruction model is divided into two parts: a shared encoder and a site-specific decoder to mitigate domain shift. In FedGIMP [57], each site generates a synthetic image to enforce MRI reconstruction: the local generators follow a federated scheme to produce synthetic priors, which are combined with global priors as in adversarial models. FedPR [63] addresses the communication bottlenecks and catastrophic forgetting by learning prompts with minimal trainable parameters and only updating local prompts in the approximate null space of global prompts. In this way it efficiently tackles catastrophic forgetting ensuring that knowledge beyond the local distribution is not overwritten. Wang et al. [221] deal with the problem of misaligned unpaired neuroimaging data. They used significantly distorted photos to train a federated-based framework on brain image synthesis without breaking the hospital’s privacy regulations.
4) COVID-19 Detection
Privacy protection is also required for contributing to breakthrough medical discoveries that necessitate data sharing across borders. The identification and segmentation of COVID-19 lesions from medical images have been targeted using deep learning with the aim to provide a useful tool for doctors. Yan et al. [240] investigate the use of FL for COVID-19 detection from chest X-ray images. They conducted experiments using different backbones and demonstrated the effectiveness of federated learning for training using distributed data without compromising privacy. Dou et al. [53] present a study that shows the effectiveness of FL for detecting COVID-19 lung abnormalities in CT scans. They trained their framework with private data from 3 different medical institutions in Hong Kong and validate their model on 4 external datasets, which consisted of CT images from different countries. Kumar et al. [113] propose a blockchain-federated learning approach for COVID-19 detection from CT scans to facilitate privacy-preserving learning. The global model, like blockchains, is maintained as a distributed ledger with expanding lists of data records that are privacy-ensured by cryptographic hashes. Yang et al. [242], propose a federated semi-supervised learning method for segmenting COVID-19 regions in chest CT images using data from China, Italy, and Japan. They used a combination of labelled and unlabeled data from the participating sites to train the model. Finally, Zhang et al. [265] suggest a dynamic-fusion-based federated learning approach for COVID-19 detection which exploits both X-ray and CT scan images. The study presented a method for dynamically fusing the model updates from different clients.
Benchmarks
Comparing performances of different federated learning schemes is a challenging task since it is a relatively new area of research and there is still some lack of standardization in its application. In particular, although several efforts have been made to introduce standard benchmarks, the community has yet to agree on a common setting for FL.
The lack of a consistent data distribution across the clients and of a common choice of parameters (e.g., number of clients, local epochs, rounds, etc.) makes often the obtained performances not directly comparable. For this reason, Tables 1, 2, 3, 4 and 5 include also the various considered settings for the various approaches tackling the different tasks.
Without a standardized setting, it is hard to determine whether a particular algorithm or model is really better than another, or if the difference in performance is simply due to differences in the experimental setup. Nevertheless, in this section, we introduce the main efforts in outlining benchmarks and then we present some results, trying to organize them into groups of comparable outcomes.
LEAF [26] offers a benchmark for federated learning that focuses on edge devices with limited computational resources. It includes benchmark models and several datasets, including a federated split for EMNIST [42], the so-called FEMNIST. FedML [78] proposes a benchmark suited for federated learning that includes both vision and non-vision datasets. The benchmark comprehends baseline models (e.g., FedAvg [155] and FedProx [128]) and federated splits for CIFAR-10 [112] and CIFAR-100. FedScale [114] presents a benchmark for federated learning that focuses on scaling up to large numbers of devices. It includes some vision datasets, among which OpenImage [111] and Google Landmark [228], as well as baseline models and evaluation metrics. These benchmarks [26], [78], [114] present experiments with various settings, including different numbers of devices, data distribution, and organizations of the communication rounds.
For the two tasks of image classification and semantic segmentation we present a comparison of recent papers on the most common dataset respectively in Tables 6 and 7, while for the other tasks the great variability of considered datasets and settings make the comparison less significant.
Starting from image classification we focus on the CIFAR-10/100 dataset, which is the most common choice for works tackling this task. Following the previous discussion, we divided Table 6 into blocks where each block corresponds to a set of experiments in the same conditions: results are comparable inside the same block but otherwise refer to different settings. As a general result, it is possible to see how state-of-the-art techniques provide a relevant improvement with respect to baseline approaches like FedAvg or FedProx.
Results for semantic segmentation are in Table 7. Here there is a much more limited number of comparable approaches and two most common choices for the dataset: Cityscapes and CamVID. Again, ad-hoc approaches for the task achieve gains of 5–10% with respect to the baseline FedAvg scheme.
New Trends
Traditional machine learning algorithms are commonly trained on specific datasets to address particular tasks. However, this approach encounters limitations when confronted with novel tasks or domains lacking sufficient labeled data. Meta-learning offers a solution to this challenge by learning from a diverse set of tasks, facilitating adaptation to new tasks efficiently. Additionally, it could contribute to FL by assisting in selecting relevant tasks or clients in each round, which enhances overall efficiency. Consequently, it would allow for personalized FL applications, making it easier to deploy personalized models. Although considerable research has been conducted on this subject [30], [101], there is a need for future studies to delve into meta-learning techniques and facilitate the emergence of personalized federated learning [169], [244].
In real-world computer vision settings, especially when the clients are users’ mobile devices it is also not too realistic to assume the availability of labeled data at client side. Thus, federated unsupervised domain adaptation approaches are another intriguing research direction [193], [245].
Another salient aspect to consider is that computer vision primarily focused on single-modality data, i.e, RGB images, and single objective tasks. However, real-world FL applications often involve multiple data types, including images, textual descriptions, 3D or depth data, and other data coming from different sensors. Learning a joint representation from multiple modalities, typically leads to deeper and more meaningful understandings of the provided information [16], [184]. Nevertheless, multimodal data introduces additional challenges such as modality incongruities (e.g., different noise and distribution shifts) and missing modalities [65] that are even more challenging in FL settings where they can appear in a client-dependent way. Recent works extend FL concepts to tackle these problems in vision-language tasks [36], [65], [234], [255]. To this end, we expect more research on multimodal FL, by allowing the aggregation of knowledge from different data modalities in order to train a more comprehensive and robust model.
Additionally, there is a rising demand for more extensive research regarding the use of foundation models in FL. These models can be employed on the server-side to substantially enhance the performance and convergence speed of computer vision tasks. The pre-training of the models from scratch is a costly operation, both in terms of computational resources and training data. The utilization of foundation models can help overcome this issue. Therefore, researchers have started to explore novel approaches, taking inspiration from traditional transfer learning methods. These techniques aim to facilitate the seamless integration of large-scale pre-trained models, such as BERT [208] or CLIP [146], into the existing systems.
Conclusion
Federated learning research has grown significantly over the past years, and computer vision has been one of the driving fields. The large size of training data and the privacy issues in some application fields like face recognition or medical imaging make federated learning a valuable tool for vision applications. Consequently, many approaches have been developed and this survey highlights the most significant ones.
With respect to standard deep learning approaches for computer vision, FL research has not evolved in a straightforward comparison of performances on benchmark datasets. Instead, most works try to address various practical problems by also designing the considered setting. On one side, this diversity of research directions stimulated the development of a variety of new techniques and ideas. On the other hand, it makes difficult to perform comparisons across approaches and consequently the understanding of which are the best building blocks to be used in real applications. To this aim, in this paper, we presented an overview of the most relevant tasks, problems and methods, in order to gain a better view of the whole picture. Despite these challenges, there are ongoing efforts to standardize federated learning settings. As the field continues to grow, it is likely that a more standardized approach to federated learning will emerge, enabling researchers to better compare results and build upon previous work.
Another limitation regards the target application: most of the techniques focus on one or a few vision tasks, although the algorithms could technically address other tasks. Following this observation, we firstly presented the employed methodologies in a task-agnostic way in Section V and then introduced the various approaches organized according to the considered task in Section VI. We think that our categorization of the main methods could be beneficial for new research as different methods and strategies can be combined for better performances.
ACKNOWLEDGMENT
(Donald Shenaj and Giulia Rizzoli contributed equally to this work.)
NOTE
Open Access provided by 'Università degli Studi di Padova' within the CRUI CARE Agreement