Journals & Magazines >IEEE Access >Volume: 11

Enabling All In-Edge Deep Learning: A Literature Review

An overview of the structure of the survey paper.

Abstract:

In recent years, deep learning (DL) models have demonstrated remarkable achievements on non-trivial tasks such as speech recognition, image processing, and natural langua...Show More

Metadata

Abstract:

In recent years, deep learning (DL) models have demonstrated remarkable achievements on non-trivial tasks such as speech recognition, image processing, and natural language understanding. One of the significant contributors to the success of DL is the proliferation of end devices that act as a catalyst to provide data for data-hungry DL models. However, computing DL training and inference still remains the biggest challenge. Moreover, most of the time central cloud servers are used for such computation, thus opening up other significant challenges, such as high latency, increased communication costs, and privacy concerns. To mitigate these drawbacks, considerable efforts have been made to push the processing of DL models to edge servers (a mesh of computing devices near end devices). Recently, the confluence point of DL and edge has given rise to edge intelligence (EI), defined by the International Electrotechnical Commission (IEC) as the concept where the data is acquired, stored, and processed utilizing edge computing with DL and advanced networking capabilities. Broadly, EI has six levels of categories based on the three locations where the training and inference of DL take place, e.g., cloud server, edge server, and end devices. This survey paper focuses primarily on the fifth level of EI, called all in-edge level, where DL training and inference (deployment) are performed solely by edge servers. All in-edge is suitable when the end devices have low computing resources, e.g., Internet-of-Things, and other requirements such as latency and communication cost are important such as in mission-critical applications (e.g., health care). Besides, 5G/6G networks are envisioned to use all in-edge. Firstly, this paper presents all in-edge computing architectures, including centralized, decentralized, and distributed. Secondly, this paper presents enabling technologies, such as model parallelism, data parallelism, and split learning, which facilitates DL training and deplo...

An overview of the structure of the survey paper.

Published in: IEEE Access ( Volume: 11)

Page(s): 3431 - 3460

Date of Publication: 05 January 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3234761

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

The global community is increasingly becoming a data-driven environment in which end devices are generating vast quantities of data outside of the traditional data centers. International Telecommunication Union anticipates that global internet traffic per month will reach 607 Exabytes (EB) in 2025 and 5016 EB in 2030 [1]. This enormous amount of data has a positive impact on artificial intelligence (AI) applications. In particular, deep learning (DL) relies on the availability of large quantities of data for its development, including training and inference [2], [3].

DL has shown promising progress in natural language processing, computer vision, and big data analysis in recent years. For example, DL models, such as BERT, Megatron-LM, GPT-3, and Gropher, are reaching a human-level understanding of the textual data in natural language processing tasks [4]. Moreover, DL models have exceeded human performance on various tasks, including object classification tasks [5], [6] and real-time strategy games [7].

DL training and deployment in the majority of scenarios use a centralized cloud-based structure. However, the need to collect, process, and transfer vast data to the central cloud often becomes a bottleneck in many mission-critical use cases [8], [9]. In this regard, edge computing provides a high-performance bridge from local systems to private and public clouds. The edge of the network, which often has modest hardware and memory resources (depending on the network infrastructure provider), can offer vital infrastructure to facilitate DL at the edge. Traditionally to avoid the bottleneck in many mission-critical use cases, edge computing performs tasks such as collection, filtering, and lightweight computation of raw data before transferring data to the cloud [10]. However, with the proliferation of edge servers and progress in DL-based architectures and algorithms, there is a possibility to perform DL model training and deployment efficiently at the network’s edge.

The convergence of DL and edge computing has given rise to a new paradigm of intelligence called edge intelligence (EI) [11], [12]. EI aims to facilitate DL deployment closer to the data-generating source. As EI exploits the full potential of resources available at end devices, edge servers, and cloud servers for DL training and inference, based on resource utilization, it is categorized into six levels [13]. These six levels are defined based on where the DL-model training is taking place and where it is getting deployed in the network hierarchy. For simplicity, we assume a network hierarchy formed of cloud servers, edge servers, and end devices. DL training and deployment at cloud servers face significant challenges, including issues such as high latency, data privacy, network congestion, and security threats such as Denial-of-Service attacks [14]. On the other hand, despite being available in huge quantities, end devices suffer from constrained computation power, which is particularly relevant in the context of training and deployment of large DL models. In this regard, edge servers are a viable alternative. Moreover, due to their closer proximity to end devices, edge servers enable reducing network congestion in comparison to the centralized cloud architecture. Furthermore, this proximity minimizes latency, providing for quicker inference when compared with DL models deployed at the cloud server. Even though edge servers have less computing power than the cloud, they do have significantly more computational power than end devices. Thus, edge servers can train and deploy DL models that require larger computing resources than that available at the end devices.

The exclusive use of edge servers for both DL training and deployment is called all in-edge. Innovations and research on the emerging area of all in-edge DL processing are in their infancy. Unlike prior surveys [13], [15], [16], [17], [18], [19] summarized in Table 1, to the best of our knowledge, none of the existing surveys presents a detailed view from the all in-edge level perspectives on its enablers, key metrics of performances and challenges when DL is processed at all in-edge level. Specifically, this survey answers the following: leftmargin=0.5cm

Which architecture (centralized, decentralized, and distributed) should be used if the configuration of edge servers is known at the all in-edge level?
What are the state-of-the-art enabling technologies that facilitate DL training and inference from the all in-edge level?
What are the critical performance metrics required in addition to the standard metrics (e.g., accuracy and precision) to evaluate the performance of the DL model’s applications at the all in-edge level?

TABLE 1 Summary of Related Surveys. ✗: Not Included; •: Not Considered From All In-Edge Paradigm; ✓: Included

This paper is organized in the following way. It first introduces the computing paradigm and the all in-edge level of EI in Section II. Then, in Section III, it discusses the architecture, enabling technologies for training and inference of DL models at the all in-edge paradigm. Besides, this paper examines the model adaption techniques for effectively deploying DL models at the edge. Next, it reviews the key performance metrics used for evaluating all in-edge DL processing in Section IV. Section V discusses the open challenges and future direction of research for DL at all in-edge. Finally, Section VI presents a summary and identifies the primary conclusions and findings of the paper. Overall, Figure 1 depicts the organization of this paper in the block diagram, and Table 2 provides the list of important acronyms.

TABLE 2 List of Important Abbreviations

FIGURE 1.

Overview of the structure of the survey paper.

Show All

SECTION II.

Preliminary

The centralized nature of the cloud data center has several drawbacks. One of the most considerable disadvantages is the distance between the data centers and end (user) devices, as it requires more wait time to process the data. On the other hand, edge computing offers an indisputable advantage by physically moving storage and processing resources closer to the source of data generation, thereby achieving lower latency. This Section presents the distinction between the cloud and edge computing paradigms. Besides, it presents the all in-edge level of the EI paradigm, which comprises only edge servers.

A. Introduction to the Cloud and Edge Computing Paradigm

The computation of DL can be done by various devices, including cloud servers, edge servers (ESs) and edge devices(EDs). This determines the following computing paradigms.

1) Cloud Computing

Cloud computing is a paradigm for wide-reaching distributed computing that uses technologies such as grid computing, service orientation, and virtualization. It enables on-demand infrastructure access to a shared pool of configurable computing resources that can be acquired and released with minimum intervention from the server infrastructure provider. Cloud servers have significant storage capacity and computational power to facilitate the overwhelming data coming via the backhaul network from end-user [20], [21]. Thus, cloud servers can satisfy resource requirements for aggregation, pre-processing, and inference for any artificial intelligence-based applications. The cloud servers are inter-connected, providing global coverage with a backhaul network. The cloud computing paradigm involves the end devices that offload data directly to the cloud for further processing. The end devices mentioned here are the originators of the data. In the cloud, data can persist for days, months, and years, meaning long-term temporal data can be collated and processed. For example, cloud data centers facilitate forecasting models based on a large amount of historical time series data [22]. Cloud computing is still the appropriate vehicle for modeling and analytical processing if latency requirements and bandwidth consumption are not an issue, provided measures for preserving privacy and security are in place [23].

2) Edge Computing

With the surge in the proliferation of IoT devices, traditional centralized cloud computing struggles to provide an acceptable Quality of Service (QoS) level to the end customers [24]. To meet the QoS of IoT applications, there is a need for cloud computing services closer to data sources (e.g., IoT devices, EDs, etc.). As defined by International Electrotechnical Commission (IEC), the extension of computing services from cloud computing to the network edge is called edge computing (EC) [25], [26]. Edge computing helps application developers cater to user-centric services closer to clients. In contrast to cloud computing, latency incurred from edge computing is significantly less, as a majority of data does not have to travel via a backhaul network to the cloud [27]. Less consumption of backhaul networks also means the requirement of bandwidth consumption is considerably less, as shown in Figure 2.

FIGURE 2.

Layered network architecture with cloud, edge and end devices (the left part of the figure), and the ratings of edge intelligence (EI) into six levels (the right part of the figure).

Show All

B. All In-Edge Level of Edge Intelligence

Significant progress has been made in the DL domain in the last decade. Technical advancements in high-performance processors [28] coupled with improvements in DL algorithms, and the availability and maturity of big data processing [29] have contributed to the increase in DL performance. However, DL processing (training and inference) still occurs mainly in the cloud, as DL models require significant computational resources. As mentioned earlier, this can adversely impact the DL’s QoS due to high latency. At the same time, there has been substantial research focused on facilitating DL processing at the edge. While edge computing provides relatively modest computing resources and storage capacity, the training and deployment of DL applications on such devices would greatly help in achieving acceptable QoS for real-time DL applications. For example, real-time applications that would benefit from the merger between edge computing and DL include automated driving [30], and real-time surveillance [31], all of which intrinsically require fast processing and rapid response time [32], [33]. The concept of edge intelligence is a new paradigm that utilizes end devices, edge nodes, and cloud data centers to optimize the processing of DL models (for both training and inference) [13].

As depicted in Figure 2, edge intelligence is divided into six distinct levels based on computational resources offered by the cloud, edge, and end devices for the DL training and inference phase. The fifth level of edge intelligence, depicted in Figure 2, corresponds to all in-edge processing. As defined in [13], all in-edge (fifth level) refers to the edge intelligence paradigm where both training and inference of the Deep Neural Network (DNN) take place in the ES (also known as in-edge manner). This level is critical to satisfying the latency requirements of real-time artificial intelligent applications. In addition, it is helpful in scenarios with intermittent or limited connectivity to the backhaul network [34]. This level helps in reducing the amount of data that needs to be transferred from end devices to the cloud whenever the DL model is being trained. Also, inference provided by the all in-edge level is faster than any other level of EI where inference takes place in the cloud data center [35].

The modest computational resources available with ESs when processing DL models at an all in-edge level facilitate training and inference of relatively large models [36]. Based on the DL model’s size, either a single ES can train a DL model or a group of ESs collaborate to train a DL model. Technologies for training DL at level five are described in detail in Section III-B. Similarly, inference from all in-edge can be produced from either a single ES or multiple ESs working collaboratively.

SECTION III.

Deep Learning at All In-Edge

This Section reviews the current state of the art for training and adapting DL models from the all in-edge level perspective. Furthermore, the Section details the different architectures employed for DL training within the all in-edge level.

A. Architecture

The architecture used for DL training at the ES can broadly be divided into three main categories: centralized, distributed, and decentralized, as shown in Figure 3. The architecture is defined based on the role of two different types of ES. The first is the processing ES, which is tasked with training the DL model, and the second is the decision-making ES, which coordinates how the model is shared across the network.

FIGURE 3.

Architecture for training Deep Learning model in-edge: (a) Centralized, (b) Decentralized, and (c) Distributed Architecture.

Show All

1) Centralized Architecture

In a centralized architecture (Figure 3(a)), the processing ES sends the data produced by the end devices (without training local DNN) to the decision-making ES. Decision-making ES then undertakes the DNN training task acting as processing ES [37], [38]. The centralized ES is assumed to have sufficient computing power (and typically, the computing power of the decision-making ES exceeds that of each of the processing ES). In this architecture, the decision-making ES is responsible for acting as both the processing and decision-making ES. Due to decision-making ES acting as processing ES at the same time makes it vulnerable to a single point of failure.

2) Decentralized Architecture

In a decentralized architecture, depicted in Figure 3(b), each processing ES is responsible for training its own local DNN. Once a local model is trained, the ESs send their local DNN model copy to a corresponding decision-making ES. This decision-making ES aggregates the DNN models and subsequently shares it with other decision-making ES to whom it is connected [40], [41]. Compared to centralized architecture, decentralized architecture addresses the single point of failure by dispersing models amongst multiple decision-making ESs. Thus, even if a single decision-making ES was to go offline, the system could continue operating.

3) Distributed Architecture

A distributed architecture aims to provide a much more resilient architecture by making each ES (typically decision-making ES) capable of processing (training a local copy of DNN) and making decisions on how to share the data across the networks with other peers. In this architecture, each ES establishes a random peer-to-peer connection with another ES in the network for that specific iteration to share their local models. The receiving ES aggregates received model weights with their local copy of parameters. The training of DNN is stopped once the loss stabilizes in most ESs, and further updating the model parameters does not change the model’s estimate for a given classification or regression problem [39].

B. Enabling Technologies

This Section focuses on the technologies that enable the model training process undertaken by the ES. Model parallelism, aggregation frequency control, gossip training, gradient compression, data parallelism, federated learning, and split learning at the ES are emerging technologies, as seen by the substantial amount of research interest and citations, shown in Table 3.

TABLE 3 Comparison of Enabling Techniques for All In-Edge

1) Model Parallelism/DNN Splitting

Model Parallelism (also referred to as model splitting or DNN Splitting) is a technique in which the DNN is split across multiple ESs to overcome the constrained computing resources. Model parallelism utilizes a decentralized architecture such that after DNN partitioning, a number of processing ESs train different layers of the DNN model, and a decision-making ES coordinate the training and ensure the correct flow of activations. The model partitioning ensures that the workload assigned to an individual processing ES does not exceed its computational capabilities. Model splitting can be categorized into either horizontally partitioned or vertically partitioned, as shown in Figure 4. In the vertical partitioning approach, one or more layers of the DNN are housed in different servers based on the computational requirement of the layer and the available resources of the processing ES. Whereas in horizontal partitioning, neurons from different layers are placed together based on the computational power of the processing ES. Horizontal partitioning is beneficial when input data is significantly big (number of attributes in a dataset) and single processing ES fails to perform a single-layer operation.

FIGURE 4.

Model Parallelism.

Show All

In [40], the authors proposed a framework for scheduled model parallel machine learning called STRADS for vertical partitioned parallel machine learning. The DL application scheduler introduced in the STRADS framework helped control the update of the model parameters based on the model’s dependency structure and parameters of the DNN model. The authors also successfully demonstrated $10\times $ faster convergence of the model parallelism-based topic modeling implementation over the model without parallelism. In 2021, research [41] on training the Megatron language model, authors utilized horizontally partitioned model parallelism to train a multi-billion parameter language model. In contrast to the single-GPU-per-model training, the authors in this research implemented model parallelism on the same PyTorch transformer implementations with few modifications. To train such a big system, 512 GPUs were consumed to train the transformer-based model. The same model was then able to achieve the SOTA accuracy on the ReAding Comprehension Dataset From Examinations (RACE [42]) dataset with improved throughput by 10% as compared to existing approaches. Model parallelism provides a way to combine the resources from multiple processing ES to enable all in-edge training of a single DL model.

2) Aggregated Frequency Control (AFC)

AFC adopts a decentralized architecture for training DL models, in which a finite number of discrete clusters of ESs are formed, as shown in Figure 5. The task of each of the discrete clusters is to train an identical DNN model. Each cluster has one ES that acts as a decision-making ES. The task of the decision-making ES is to provide all processing ESs in the cluster with an identical copy of the DNN model. Once each processing ES receives its copy, they train that model using their local data and send back the updated DNN model weights to the decision-making ES for aggregation. The decision-making ES aggregates the weights from each of the individual processing ESs in the cluster. Once aggregation is done, the decision-making ES sends back the updated DNN model to all the processing ESs in the cluster. In addition, after each aggregation at the decision-making ES, a “significance function” is computed. This function will determine if the current aggregation has led to a significant improvement. If the improvement is deemed significant, then the current cluster’s decision-making ES will inform the decision-making ES of each of the other clusters of the new model weights. Hence, each decision-making ES will have the best available model copy at any given point in time.

FIGURE 5.

Aggregated Frequency Control (AFC).

Show All

The significance function in AFC influences the frequency with which updated weights are sent from one decision-making ES to another. This, in turn, can reduce the communication overhead in the network. The Approximate Synchronous Parallel (ASP) model [43] is one such model that targets the problem of geo-distributed DL training. This research successfully employed an intelligent communication system based on the AFC technique achieving minimization in WAN communication by the factor 1.8-$53.5\times $ between the two data centers.

3) Gossip Training

Gossip Training provides a way to reduce the training time in a distributed architecture. Gossip training is based on the randomized selection of the ES to share the gradient weights for aggregation [44]. Each ES acts as a decision-making ES and processing ES to make the whole training system fault resilient. In this technique, ES will randomly select another node and subsequently send the gradient weight updates to the selected ES. Each ES will then compute the average received weights. Gossip training works in a synchronized and distributed manner. In [45], researchers demonstrated that GoSGD (Gossip Stochastic Gradient Descent) takes 43% less time to converge to the same train loss score when compared to the EASGD (Elastic Averaging SGD [46]) algorithm used in distributed architecture training. In other research, PeerSGD [47] modified the GoSGD algorithm [45] to work in the distributed trustless environment. The algorithm was modified at the stage when the random peer was selected to share the update. The peer who receives the update can decide whether to accept the received weights based on the loss difference (hyper-parameter defined in the research). PeerSGD was evaluated with various clients ranging from 1 to 100. In the experiment, PeerSGD demonstrated $2\times $ faster convergence when tested with 10 clients compared to 100 clients, but it still had comparable accuracy. The limitation of PeerSGD is its inability to achieve convergence in a scenario when data classes are segregated across multiple clients. A modified version of GoSGD was also applied to Wide Area Networks [48], and heterogeneous edge computing platforms [49] and demonstrated results comparable to the original GoSGD algorithm. Gossip training facilitates all in-edge model training without any central authority, making the training process more resilient if any ES is not reachable during training.

4) Gradient Compression

Gradient Compression is another approach to reducing communication while training the DL model, which can be applied to either a distributed or decentralized architecture to facilitate all in-edge training. Gradient compression minimizes the communication overhead incurred by addressing the issue of redundant gradients. Authors in the research [50] found that 99.9% of the gradient exchange in distributed stochastic gradient descent is redundant. They proposed a technique called Deep Gradient Compression, which reduced the communication necessary for training ResNet-50 from 97 MB to 0.35 MB. In gradient compression, two approaches are used in practice: gradient quantization and gradient sparsification.

In gradient quantization [51], gradient weights are degraded from having a higher order of precision values to a lower precision order i.e., representing weights using float 12 rather than float 64. In [52], the author proposed high-dimensional stochastic gradient quantization for reducing the communication in the federated learning setting (federated learning setting is explained in III-B6-6). In the proposed architecture, the authors utilized a uniform quantizer and low-dimensional Grassmannian to decompose the model parameters, followed by compression of the high-dimensional matrix of stochastic gradients into its norm and normalized block gradients. Normalized block gradients are then scaled with a hinge vector to yield the quantized normalized stochastic gradient (QNSD). This QNSD was then transmitted by the processing ES, who trained the model to the decision-making ES, who then aggregates the various gradients and updates a global DL model. Through the framework of hierarchical gradient quantization, authors reduced the communication overhead theoretically and, at the same time, achieved a similar accuracy to the SOTA signSGD scheme [53].

Another approach to gradient compression is gradient sparsification. This technique allows the gradient exchange only if the absolute gradient values are higher than a certain threshold [54]. For example, the threshold in the research ranged from 2 to 15. So if the absolute values of the gradients elements exceed the threshold, they are allowed to be transmitted. The higher the value of the selected threshold, the lower the communication cost (as the threshold limits the transmission of gradient weights). This method reduced the required communication bandwidth by three orders of magnitude for data-parallel distributed SGD training of DNNs. Recent research [55] found that selecting an appropriate threshold is challenging due to the variation in the value of the gradients. This research proposed an alternative approach called the edge Stochastic Gradient Descent (eSGD) method. In eSGD, determining if the gradient update should be sent over the network is based on the loss function. The loss function is used to compute the loss against each coordinate of the gradient at time steps ‘$t-1$ ’ and ‘$t$ ’. If the loss value at time step ‘$t$ ’ is smaller than its value at time step ‘$t-1$ ’, the current gradient ‘$gt$ ’ will be transmitted to other ESs to build a global model. The standard SGD, when applied to MNIST with 128 batch size and trained for 200000 epochs, will achieve 99.7% accuracy. In contrast, the eSGD method with the same setting attained an accuracy of 95.31% and 91.22% with a drop ratio (% of gradients that will not be communicated by ES) of 25% and 50%, respectively. In [56], the authors aim to identify an optimal trade-off between the communication that takes place within the layers of a DNN (housed in different ESs) and the computations required for the gradient sparsification. The authors developed an optimal merged gradient sparsification algorithm that required 31% less time per iteration over the SOTA sparsified SGD. For the all in-edge paradigm, the size of the message being communicated by the servers utilizes a significant bandwidth. The gradient compression approach helps reduce the size of the message being communicated from one ES to another, thereby freeing up network bandwidth which can then be utilized by other edge applications.

5) Data Parallelism

Data parallelism (also referred to as data splitting) is a technique that follows a decentralized architecture at the all in-edge level. A sizeable primary dataset is split in data parallelism to form mutually exclusive smaller datasets. These datasets are then forwarded to the processing ESs. In this architecture (see Figure 6), the decision-making ES initially distributes the uninitialized model copy to each processing ESs. The processing ES starts training after it receives the dataset and the initial model copy. The decision-making ES is responsible for producing the global model by aggregating the local models residing inside the processing ESs. The global model is next sent back to the processing ESs so that it can continue to update its local model [57], [58], [59].

FIGURE 6.

Data Parallelism.

Show All

6) Federated Learning

Federated learning (FL) is a popular framework for training DL models using a decentralized and distributed architecture [60]. Although the native framework treats mobile devices as clients responsible for training the DL model, recent research shows clients can be extended to the ES [61], [62], which makes this technology applicable for all in-edge. In this section, the ‘client’ refers to processing ES with low computing resources, and the ‘aggregation ES’ refers to decision-making ES with modestly higher computing capacity than the client.

Federated learning enables ES to collaboratively learn a shared DL model while keeping all the training data on the client. As shown in Figure 7, during the first stage, all the clients download the global DL model from the aggregation ES, which is responsible for maintaining the global DL model. Once the global DL model is received, the client trains it using its own private data, making it a local DL model. Once training is completed on the client, the local model weights are sent to the aggregation ES. Once the aggregation ES receives all the weights from the participant client, it is then aggregated to formulate the new global DL model [63], [64], [65]. After aggregation, the global DL model is again circulated to the client for further training, making the whole approach cyclic. This framework ensures that the performance of the aggregated global model should be better than any of the individual client-side models [66] before being disseminated.

FIGURE 7.

Vanilla federated learning.

Show All

Federated Learning Systems (FLS) can be further categorized based on their data partitioning strategy, privacy mechanism, and communication architecture [66], [67], [68], [69]. The data partitioning strategy dictates how the data is partitioned across the clients. There are three broad categories of data partitioning (i) horizontal data partitioned FLS, (ii) vertical data partitioned FLS, and (iii) hybrid data partitioned FLS. In horizontal data partitioning, all the clients have the same attributes/features in their respective datasets needed to train the private DL model. Whereas in vertical data partitioned, all the clients have different attributes/features in the dataset. By utilizing entity alignment techniques (which helps find the overlap in other datasets where some of the features are common) [70], [71], overlapped samples are collected for training machine learning models. Hybrid data partitioning utilizes the best of both worlds. The entire dataset is divided into horizontal and vertical subsets in this category. So each subset can be seen as an independent dataset with fewer non-overlapping attributes and data points compared to the entire dataset [68].

FLS provides privacy to a certain degree by default by allowing raw data to stay only with the client ES. However, while exchanging the model parameters, there is the possibility that exchanged model parameters could still leak some sensitive information about private data [72]. Therefore, privacy mechanisms have been employed for FLS. These mechanisms can be subdivided into either cryptographic techniques or differential privacy techniques. Cryptographic techniques require that both the client and aggregation ES operate on encrypted messages. Two of the most widely used privacy-preserving algorithms are homomorphic encryption [73], [74], [75], [76] and multi-party computation [77], [78], [79]. On the other hand, differential privacy introduces random noise to either the data or the model parameters [80], [81], [82], [83]. Although random noise is added to the data or model parameters, the algorithm provides statistical privacy guarantees while ensuring that the data or model parameters can still be used to facilitate effective global model development.

The communication architecture of an FLS can be broadly subdivided into two subcategories: distributed and decentralized architectures. In a decentralized architecture, the aggregation server is responsible for collecting and aggregating the local models from each client. It then sends the updated global model for retraining to each client. In this architecture, communication between the processing ES and decision-making ES can happen in synchronous [68], [84] as well as in asynchronous [84], [85], [86], [87] manner. One of the significant risks in a decentralized architecture setting is that the decision-making ES may not treat each processing ES equally. That is, the decision-making ES may have a bias toward specific processing ES due to their higher participation during a training phase. A distributed architecture can mitigate the potential issues of bias. A distributed architecture in federated learning can be based on a P2P scheme (ex., gossiping scheme as described in Section III-B3), a blockchain-based system, or a graph-based system. In a distributed architecture, all the participating ESs are responsible for acting as processing and decision-making ES. Therefore, if a gossip scheme is implemented to achieve the decentralized FLS, all the models will randomly share the updates with their neighbors [88], [89]. In contrast, if a blockchain system is implemented, it leverages smart contracts (SC) to coordinate the DL training, model aggregation, and update tasks in FLS [90], [91], [92], [93], [94]. Lastly, if graph-based FLS is implemented, each client will utilize the graph neural network model with its neighbors to formulate the global models [95], [96], [97], [98].

FLS provides a much-needed way of enabling the DL model training and inference at the all in-edge paradigm. With an FLS, one can easily integrate multiple low-resource ESs to help train the DL model at the edge. Also, based on the resources available at the edge and the communication overhead of FLS, one gets the freedom to select either a distributed or decentralized architecture.

7) Split Learning

In federated learning, each processing ES is responsible for locally training the whole neural network. In contrast, split learning provides a way to offload some of this computation between processing and decision-making ESs. More differences between federated learning and split learning are summarized in Table 3. As we advance in this section, the ‘client’ refers to processing ES with low computing resources, and the ‘server’ refers to the decision-making ES with a modestly higher computing capacity than the client. Split learning divides a neural network into two or more sub-networks. Figure 8 illustrates the case where we split a seven-layer neural network into two sub-networks using layer 2 as the “cut layer”. After the split, the two sub-networks are shared between the client, who trains the initial two layers of the network, and the server, who trains the last five layers of the network. At the training time, the client initiates the forward propagation of its confidential data and sends the activation from the cut layer to the server-side sub-network. The server then continues the forward propagation and calculates the loss. During backpropagation, gradients are computed and propagated initially in the server sub-network and then relayed back to the client side sub-network. In Split learning, during the training and testing, the server never gets access to the parameters of the client-side network or the client’s data.

FIGURE 8.

Split Learning.

Show All

Split learning can be broadly categorized into three configurations based on how the input data and labels are shared across the clients and servers. Figure 9 shows three configurations- simple vanilla split learning, split learning without label sharing, and split learning for vertically partitioned data. A main neural network is partitioned into two sub-networks in simple vanilla split learning. The initial sub-network, along with the input data for the neural network, remains with the client, whereas the remaining sub-network, along with the labels, resides with the server [99]. Split learning without label sharing is identical to vanilla split learning, except that the labels reside with the client instead of the server. To compute the loss, the activations outputted from the server-side network are sent back to the client, who holds the last layer of neural network [100]. The loss is calculated, and gradients are computed from the last layer held by the client and then sent back to the server, and backpropagation takes place in the usual way. The final configuration of split learning is where the clients train their partial sub-network for vertically partitioned data and then propagate the activations to the server-side sub-network. The server-side sub-network then concatenates the activations and feeds them to the remaining sub-network. In this configuration, labels are also shared with the server [101].

FIGURE 9.

Different configurations of Split Learning- (a) simple vanilla split learning, (b) split learning without label sharing and (c) split learning for vertically partitioned data.

Show All

In a federated learning system, clients can interact with the server in parallel, which helps achieve faster training compared to a split learning approach. In split learning, the server must wait for all clients to send their activations before propagating the activation through the server-side network. Also, in contrast to federated learning, split learning reduces the computational requirements on the client-side (as only a partial amount of the network resides with the client). Recently, to leverage the advantages of both split learning and federated learning, a hybrid technique called splitfed learning was proposed [102].

In splitfed learning, a DL model is broken down into the sub-networks shared amongst the clients and servers. In addition, there is a separate federated aggregation server for the client and the servers. All the clients perform the forward pass in parallel and independently of each other (as not seen in split learning). The resulting activations are sent to the server-side sub-network, which performs a forward pass for the remaining sub-network portion. The server then calculates the loss and back propagates the gradients back to the very first layer on the client-side, as described earlier with split learning. Once this process finishes, the servers send their model weights to a federated aggregation server, which aggregates the independent server-side sub-network to form a global server-side model. Similarly, the clients send their sub-network weights to another aggregation server. At the end of aggregation, a global model can be developed by combining the aggregated client-side weights with the aggregated server-side weights as shown in Figure 10 (a) [102], [103].

FIGURE 10.

Variants of splitfed learning (a) Splitfed learning with the same number of client and server-side sub-networks and (b) Splitfed learning with only one copy of server-side sub-network.

Show All

Splitfed learning can have several variants. For example, the first one is where each client has its own corresponding server-side network in the main server, i.e., the number of client-side models is equal to the number of server-side models as explained in the earlier paragraph. In the second variant, there are multiple clients but only a single server. Therefore, each client-side model sends its activations to a single common server-side sub-network, thereby reducing the required aggregation step and the need to keep multiple copies of the server-side networks as compared to the first variant as shown in Figure 10 (b). Moreover, as the server keeps only one copy of the server-side sub-network, it makes the server-side do forward and backward pass sequentially with each of the client’s data (activations of the cut layer) [103], [121].

Key takeaways: The above-mentioned enabling technologies at the confluence of DL and all in-edge contribute to our understanding of training DL models using only ESs. The enabling technologies help address issues such as limited computational resources, communication overhead and latency between ESs, data privacy, and model robustness. Model parallelism and split learning provide a means of decreasing the computational resource required by individual ES. By splitting the DL model, multiple resource-constrained ESs can train a few layers of the network (rather than the entire model). Aggregated frequency control and federated learning enable parallel model training, facilitating faster model convergence. Gossip training, federated learning, and aggregated frequency control adopt a distributed architecture, thereby robustly training a DL model in situations where the reliability of an ES is not predictable. Also, we have discussed splitfed, in which federated learning is combined with split learning. Splitfed overcomes the drawback of federated learning of training a large ML model in resource-constrained ESs [122]. At the same time, it eliminates the weakness of split learning to deal with one client at a time while training [121].

C. All In-Edge Model Adaption

Model Adaption techniques provide a means by which DL model deployment at the ES can be achieved despite the lack of computing resources, storage, and bandwidth. Model adaption techniques can be broadly categorized into model compression and conditional computation techniques, as summarized in Table 4.

TABLE 4 Comparison of Model Adaption Techniques for All In-Edge

1) Model Compression

Model compression techniques facilitate the deployment of resource-hungry DL models into resource-constrained ES by reducing the number of parameters or training DL models that have been reduced in size from the original model. Model compression exploits the sparse nature of DL models by compressing the model parameters. Model compression reduces the computing, storage, memory, and energy requirements needed for all in-edge deployment of DL models. This Section reviews pruning, quantization, knowledge distillation, and low-rank factorization.

a: Pruning

Pruning of parameters is the most widely adopted approach to model compression. This approach evaluates DL model parameters against their contribution to predicting the label. Those neurons that make a low contribution in inference are pruned from the trained model. Parameter pruning can significantly reduce the size of a DL model, but it also has the potential to impact the model’s performance adversely. In [12], the authors were able to reduce the size of the AlexNet and VGG-16 by a factor of $9\times $ and $13\times $ respectively, without incurring any loss in the accuracy over the ImageNet dataset. In another work [123], the authors utilized pruning to create a compressed speech recognition model on field-programmable-gate-array (FPGA). This technique compressed the LSTM model by $10\times $ with negligible loss in accuracy. SS-Auto [124] is a single-shot structured pruning framework. In contrast to earlier versions of pruning where the entire DL model’s parameters were selected for pruning, in structured pruning, independent pruning on columns and rows of filters and channels matrix (for CNN-based DL models) is performed. The compressed DL models produced by the SS-Auto framework did not suffer any degradation in performance, achieving the original performance levels when tested on CIFAR-10 and CIFAR-100 datasets. However, the compressed VGG-16 model reduced the number of convolutional layers parameters by a factor of 41.4% for CIFAR-10 and 17.5% for the CIFAR-100 dataset. In [125], the authors proposed a new framework based on weight pruning and compiler optimization for faster inference while preserving the privacy of the training dataset. This approach initially trains the DL models as usual on the user’s data. The model then undergoes privacy-preserving-oriented DNN pruning. Finally, synthetically generated data (with no relevance to the training data) is passed through a layer of the user-trained model. The decision to prune a parameter or not from the current layer is based on how similar (by computing the Frobenius norm) the original output of the layer (without pruning) is when compared with the output of the layer after the parameter has been pruned. If the outputs are close enough, then that parameter is pruned. This pruning technique is named the alternating direction method of multipliers (ADMM). Experimental results of the framework outperformed the state-of-the-art end-to-end frameworks, i.e., TensorFlow-Lite, TVM, and MNN, with speedup in inference up to $4.2\times $ , $2.5\times $ , and $2.0\times $ , respectively.

b: Quantization

Data quantization degrades the precision of the parameters and gradients of the DL model. More specifically, in quantization, data is represented in a more compact format (lower precision form). For example, instead of adopting a 32-bit floating-point format, a quantization approach might utilize a more compact format such as 16-bit to represent layer inputs, weights, or both [13]. Quantization reduces the memory footprint of a DL model and its energy requirements. In contrast, pruning the neurons in a DL model will reduce the network’s memory footprint but does not necessarily reduce energy requirements. For example, if later-stage neurons are pruned in a convolutional network, this will not have a high impact on energy because the initial convolutional layer dominates energy requirement [13]. In [126], the authors utilized a dynamic programming-based algorithm in collaboration with parameter quantization. With the proposed dynamic programming-assisted quantization approach, the authors demonstrated a $16\times $ compression in a ResNet-18 model with less than a 3% accuracy drop. The authors in [127] proposed a quantization scheme for the inference phase of the DL model that targets weights along with the inputs to the model and the partial sums occurring inside the hardware accelerator. Experiments showed that the proposed schema reduced the inference latency and energy consumption by up to $3.89\times $ and $4.84\times $ , respectively, while experiencing a 1.18% loss in the DL models inference accuracy.

c: Knowledge Distillation

Knowledge distillation is a model compression technique that helps train a smaller DL model from a significantly larger trained DL model. The knowledge distillation comprises three key components: (i) The original knowledge, (ii) the distillation algorithm, and (iii) the teacher-student architecture [128]. The original knowledge is the original large DL model, which is referred to as the teacher model. The knowledge distillation algorithm is used to transfer knowledge from the teacher model to the smaller student model using techniques such as Adversarial KD [129], [130], Multi-Teacher KD [131], [132], [133], Cross-modal KD [134], [135], Attention-based KD [136], [137], [138], [139], Lifelong KD [140], [141] and Quantized KD [142], [143]. Finally, the teacher-student architecture is used to train the student model. A general teacher-student framework for Knowledge distillation is shown in Figure 11. In this architecture, the teacher DL model is trained on the given dataset in the initial phase. Once the teacher DL model is trained, it assists the shallower student DL model. The student DL model also uses the same dataset used to train the teacher DL model, but labels for the data points are generated by the teacher DL model [144]. The knowledge distillation technique helps a smaller DL model imitate the larger DL model’s behavior.

FIGURE 11.

Teacher-student architecture for Knowledge Distillation.

Show All

KD provides a viable mechanism of model compression [128]. This technique helps reduce the number of ESs required to deploy the larger DL model at the all in-edge level. Reduction in the number of ES also helps achieve faster inference time from ESs (as less communication needs to be done within ESs).

d: Low-Rank Factorization

Low-rank factorization is a technique that helps in condensing the dense parameter weights of a DL model [145], [146], limiting the number of computations done in convolutional layers [147], [148], [149] or both [150], [151]. This technique is based on the concept of creating another low-rank matrix that can approximate the dense metrics of the parameter of a DL model, convolutional kernels, or both. Low-rank factorization can save memory on an ES while decreasing computational latency because of the resulting compact size of the DL model. In [152], the authors used the low-rank factorization by applying a singular value decomposition (SVD) method. They demonstrated a substantive reduction in the number of parameters in convolutional kernels, which helped reduce floating-point operations(FLOPs) by 65.62% in VGG-16 while also increasing accuracy by 0.25% when applied to the CIFAR-10 dataset. Unlike pruning, which necessitates retraining the DL model, after applying low-rank factorization, there is no need to retrain the DL model. Further research [153] proposed a sparse low-rank approach to obtain the low-rank approximation. The sparse low-rank approach is based on the idea that all the neurons in a layer have different contributions to the performance of the DL model. So based on the neuron ranking (based on the contribution made for inference), entries in the decomposition matrix were made. This approach, when applied over the CIFAR-10 dataset with VGG-16 architecture, achieved $3.6\times $ times smaller compression ratio to the SVD. Other commonly used methods for low-rank factorization are tucker decomposition (TD) [154], [155], [156] and canonical polyadic decomposition (CPD) [157], [158].

2) Conditional Computation

Conditional computational approaches alleviate the tension between the resource-hungry DL model and the resource-constrained ES. In conditional computation, the computational load of the DL model deployed over a single ES is distributed with other ES in the network. The selection of an appropriate conditional computation technique is based on the DL model’s latency, memory, and energy requirements. Therefore, depending upon the configuration of the ES and DL model’s computation requirements, DL model deployment can utilize one or any combination of the techniques (Early Exit, Model Selection, and Result Cache) defined in this section.

a: Early Exit

The main idea behind the early exit approach is to find the best tradeoff between the deep DNN structure of a DL model and the latency requirements for inference. In this approach, a deep neural network trained on a specific task is partitioned across multiple ESs. The partitioning of the DL model is based on a layer-wise split, such that a single or multiple layers can reside across multiple ESs based on the computation power provided by each ES. Each ES that hosts one or more layers of the DL model also attaches a shallower model (or side branch classifier) to the output of the final layer on the current ES. The model is then trained as shown in Figure 12. The purpose of the side branch classifier is to provide an early prediction or early exit. During inference, the data is propagated through the network (and each ES host). Each host will calculate both the output of the hosted layers and the output of the local early exit network. If the output of the early exit layer exceeds a defined confidence threshold, then the propagation stops (this is the early exit), and the ‘early’ result is returned. If the prediction from the early exit network is less than the confidence threshold, the output of the larger DL model’s layers is then propagated to the next ES in the chain, which holds the next layer of the larger DL model and another early exit network. The process of propagating the layer’s output to the subsequent layer is carried out until one ES inferences the class with a higher confidence score. This process can provide $`n - 1'$ exit points for a DL model with $`n'$ neural network layers; thus, if layer 1 of the larger DNN along with the side branch can infer the class with the required confidence that output will be given as a response to the end user eliminating any further propagation of activation values along the ES.

FIGURE 12.

Early exit adaption of Deep Neural Network.

Show All

Researchers in [159] provided the programming framework ‘Branchynet’, which helps incorporate the early exit approach into a standard DL model. The framework modifies the proposed DL model by adding exit branches at certain layers. With the multiple early exit points, it can also be considered as an enabler for localized inference using DL models with less number of layers. For the AlexNet DL model, ‘Branchynet’ framework was able to reduce the inference time by a factor of $2\times $ and $6\times $ on CPU and GPU, respectively. In [160], the authors proposed DeepQTMT to lower the encoding time spent on video compression. In the DeepQTMT, the authors utilized a multi-stage early exit mechanism to reduce the high encoding time. Experimental results showed the encoding time was reduced by a factor ranging from 44.65% to 66.88% with a negligible adverse impact in bit-rate within the range of 1.32% to 3.18%. Therefore, while the early exit strategy can decrease latency and facilitate a faster response time, it does have the drawback of increasing the memory footprint of the DL model, thus utilizing more storage at each ES.

b: Model Selection

The model selection approach selects a specific DL model for inference from a set of available DL models based on the latency, precision, and energy requirements of the end user [16]. In a model selection strategy, multiple DL models with varying DL model structures are trained. The different trained models each have a specific inference latency, energy requirements, and accuracy. Once trained, each of the models is deployed to various servers. The model selection approach will then select the DL model based on the end user requirements [13].

The model selection approach is similar to the early exit approach, with only one difference. In model selection, independent DL models are trained; in contrast, in the early exit, only one DL model is trained over which multiple exit points are created. Authors in [161] proposed a new concept of BL-DL (big/little DL) based on the model selection approach. The authors proposed the score margin function, which helps in deciding whether or not the inference made by a small DL model is valid. The score function is computed by subtracting the highest probability from the second-highest probability of a class from the last classifier layer of a DL model. Thus, a score function can be seen ranging from 0 to 1. The higher the value of the score function, the higher the estimation that inference is accurate. The lower the value of the score function, the lower the estimation of inference being accurate. If the score function estimation is low, then a larger DL model is invoked to make the inference on the same input data. The same research showed a 94.1% reduction in the energy consumption on the MNIST dataset, with accuracy dropping by 0.12%. Recently in [162], an adaptive model selection technique has been used to optimize the DL model’s inference. The proposed framework builds a standard DL model, which learns to predict the best DL model to use for inference based on the input feature data. To facilitate the training of the selection model (which is the standard KNN model in this scenario), different pre-trained models like Inception [163], ResNet [164], MobileNet [165] were evaluated on the same image dataset. For each image, the DL model that achieved the highest accuracy is set as the output. The training data for the KNN model comprises the features extracted from the image as input and the optimal DL model as output. Once the model selector (the KNN) is trained, it is then used to determine the DL model, giving the best accuracy on the selected image. In the end, the selected DL model makes an inference on the image as shown in Figure 13.

FIGURE 13.

Model Selection of Deep Neural Network.

Show All

Experimental results validated the reduction in the inference time by a factor of $1.8\times $ for the classification task and $1.34\times $ time reduction in a machine translation task. While model selection facilitates a decrease in inference time, it does incur an increased memory footprint across the ESs due to the number of pre-trained DL models.

c: Result Cache

Result cache techniques help in decreasing the time required to obtain the prediction from the ES. In this approach, frequent input queries (such as frames in the case of video classification or images in the case of image classification) and the associated predictions made by the DL model are saved in an archive on the ES. So, before any query is inferred from the DL model, intermittent lookup happens. In intermittent lookup, if a query is similar to a saved query, the result is inferred from the archive (cache). Otherwise, the query goes to the DL model for inference. This technique becomes more powerful in environments where the queries can be expected to exhibit similarity. In [166], the authors proposed a cache-based system that leveraged the ES for image classification. When evaluated on image classification applications, the approach yielded up to $3\times $ speedup on inference for image recognition tasks without any drop in the model’s performance (accuracy). Another system for video analysis utilized the cached convolution outputs of the CNN layers to reduce the computation for making an inference [167]. The idea is again based on the similarity of consecutive frames in videos. Initially, in this approach, activations from each layer of DL for a query frame are saved in the cache. For the next subsequent frame (query), the query is pushed through the first layer, and the resulting activations are compared with the previous activation values of the same layer saved in the cache. Only those activations that differ significantly from the cached version are calculated and propagated further through the network. If the activation is deemed similar, they are carried over with their cache results to the next layer. In the experiment, the authors showed a significant speedup of $3\times $ to $4\times $ compared to the vanilla CNN model with no change in accuracy. In other research [168], the authors proposed a framework similar to result caching. In this research, queries were initially passed through the DL model, and activations of each layer were cached (archived) in the ES along with the prediction from the DL model. During the inference, after passing the image through the layers of DL, activations are checked with the saved activations of a specific layer. If the activations of a particular layer for the current query match with the activations in the cache, further propagation of the activations is stopped, and the cached result is returned as the prediction. This research was applied to a VGG-16 architecture using CIFAR and yielded a $1.96\times $ latency gain using a CPU and a $1.54\times $ increase when using a GPU with no loss in accuracy. Result caching provides a significant boost in the scenario where the query (frames processing for the boundary identification) for inference does not change significantly. While result caching improves the overall latency of the neural network, it also incurs a larger memory footprint.

Key takeaways: This Section described model adaption techniques, which facilitate the efficient deployment of large DL models at the all in-edge level, divided into segments- model compression and conditional computation, summarized in Table 4.

Model compression techniques such as pruning, quantization, knowledge distillation, and low-rank factorization provide practical ways of reducing the size and memory footprint of the DL model. Reduction in model size due to model compression techniques also decreases the amount of computation needed for making an inference. However, there lies a need for DL model retraining while adopting pruning and knowledge distillation, whereas no retraining is required for quantization and low-rank factorization. Also, a drop in accuracy is observed amongst all model compression techniques leaving low-rank factorization.

Conditional computation techniques such as early exit, model selection, and result caching provide practical ways to utilize the computational resources of the available ESs to provide faster inference. In contrast to a reduced memory footprint observed while using the model compression technique, the memory footprint increases while adopting conditional computation. Also, no significant drop in accuracy is observed while utilizing the conditional computation techniques.

SECTION IV.

Key Performance Metrics of All In-Edge

The application of DL at the edge has gathered significant momentum over the last few years. Typically research evaluates the performance of a limited number of DL models often adopting a different set of standard performance metrics (such as top-k accuracy [193] and mean average precision [194]). Unfortunately, these standard metrics fail to provide insights into the runtime performance of DL model inference at ESs. Relevant performance metrics for DL services include but are not limited to latency, use-case-specific metrics, training loss, communication cost, privacy-preserving metrics, energy consumption, memory footprint, combined metrics, robustness, transferability, and lifelong learning. Table 5, summarizes the metrics/description against KPI utilized at all in-edge.

TABLE 5 Key Performance Metrics at All In-Edge

This Section will discuss the different metrics that should be evaluated when developing all in-edge based DL models.

A. Use-Case Specific Metrics

Use-case specific metrics are used to determine the quality of the trained DL model and are dependent on the problem statement. For example, if the use-case is a classification problem, then accuracy, F1-score, roc_auc, etc. can be evaluated [195], [196], [197]. Accuracy and F1-score are the most common metrics used to determine the quality of classification problems. In the classification problem, the DL model is trained to correctly predict the class of interest, i.e., true positive (TP) and the class of dis-interest, i.e., true negative (TN). Equation 1 represents the mathematical formulation of Accuracy, where FP is a false positive (classes that are wrongly classified as positive), and FN is a false negative (classes that are wrongly classified as negative).\begin{equation*} { \text {Accuracy }}=\frac {T P+T N}{T P+T N+F P+F N}. \tag{1}\end{equation*} View Source Equation 2 denotes the calculation of F1-score, commonly used when there is class label imbalance, and both classes hold the same importance in classification metric. In Equation 2, precision is the measure of the proportion of positive identifications that are actually correct, and recall is the measure of the proportion of actual positives that are correctly identified.\begin{equation*} \text {F1-score} = 2 \times \frac {\text {precision} \times \text {recall}}{\text {precision} + \text {recall}}. \tag{2}\end{equation*} View Source To assess the DL model for regression problems, metrics like max variance, R-square, root mean squared error (RMSE), etc. are evaluated [198], [199], [200], [201], [202]. RMSE defined in equation 3, is a widely used metric for regression-based problems. In equation 3, $y_{i}$ and $\hat {y_{i}}$ are the actual and predicted labels, and N is the total number of samples.\begin{equation*} \text {RMSE} = \sqrt {\frac {1}{N} \sum _{i =1}^{N} (y_{i} - \hat {y_{i}})^{2}}. \tag{3}\end{equation*} View Source While these metrics are widely used, they are essential for the performance comparison of different models’ architecture and strategies deployed on the same dataset over the ES.

B. Training Loss

The process of training a DL model requires the optimization (typically minimization) of a specific loss function. The training loss is a metric that captures how well a DL model fits the training data by quantifying the loss between the predicted output and ground truth labels. Different metrics are selected based on the type of problem, i.e., classification or regression. Some of the widely used loss functions to capture the learning of the DL model at the edge while training are mean absolute error [203], [204], [205], mean square error [206], [207], negative log-likelihood [208], [209], cross-entropy [210], [211], [212], Kullback-Leibler divergence [213], [214], [215] etc.

Cross-entropy, also called logarithmic loss, log loss, or logistic loss, is a widely accepted loss function for classification problems. In cross-entropy, the predicted class probability is compared to the actual class label. A loss is calculated that penalizes the DL model higher if the probability is very far from the actual value. The penalty itself is logarithmic, which yields a significant score for large differences close to 1 and small score for small differences tending to 0. The cross-entropy loss function is defined as \begin{equation*} L(y, \hat {y})=-\sum _{i=1}^{n} y_{i} \log \left ({p_{i}}\right), { \text {for }} \mathrm {n} { \text {classes, }}\tag{4}\end{equation*} View Source where $p_{i}$ is the predicted class probability of the $i^{t h}$ class. Similarly, for regression problems, mean squared error (MSE) is the most commonly used loss function. The loss is the mean of the squared differences between true and predicted values across the dataset. MSE is defined as:\begin{equation*} L(y, \hat {y})=\frac {1}{N} \sum _{i=0}^{N}\left ({y-\hat {y}_{i}}\right)^{2}.\tag{5}\end{equation*} View Source

C. Convergence Rate

When training a DL model, we typically monitor its loss until it reaches some measure of convergence. We would expect the loss to decrease until any further updates to DL model parameters will not change the test dataset inference made by the DL model, known as the convergence of the DL model. The convergence rate is normally computed when using a distributed and decentralized architecture to train a DL model at the edge. One of the primary goals of the distributed/decentralized DL model training at the edge is to speed up the convergence of DL models getting trained at multiple locations. Thus, DL models at different ESs need to collectively converge to a consensus that any further updates in the model will not change the estimate of the model for a given classification or regression problem [216]. Convergence rate, as a metric, defines the number of iterations one algorithm will take to converge to an optimum solution [217]. Thus, in a decentralized/distributed architecture at all in-edge, the convergence rate as a metric becomes crucial because the different combinations of the architecture selected along with synchronization schemes (synchronous, asynchronous, etc.) have different convergence rates [218], [219], [220], [221].

D. Latency

When inferring from a model at the edge, both the computational latency and communication latency become critical key performance metrics. Computational latency provides an estimate of the time that the DL model will require to process a query input and infer on the same [222], [223], [224]. Whereas communication latency provides an estimate of the time from when a query is sent from the origin server until the result is returned [175], [225], [226], [227]. For mission-critical cases [228], DL models with low computational and communication latency are more favored. This metric becomes critical because one of the reasons to move from cloud to all in-edge is to reduce the latency incurred during the DL inference phase. The measuring unit of latency can range from milliseconds to seconds based on the latency requirement from DL-based applications.

E. Communication Cost

When a DL model is deployed for inference on an ES, many requests by the end user(s) are raised to get inference from the DL model. The volume of data, e.g., kilobytes (KB) or megabytes (MB), transmitted from the end user(s) has the potential to create congestion at the ES. The communication cost metric evaluates the amount of data (message size of each query) flowing to the ES from the end user [56], [229]. It also takes into consideration the inference data, which is reverted to the end user. Active monitoring of the communication cost is important to prevent potential congestion points [230], [231], [232]. In typical cases, measuring the unit of communication cost is kept in KB or MB based on how much data is required to make an inference.

F. Privacy Preserving

Privacy-preserving metrics provide a means to quantify the level of user privacy offered by a DL model using privacy-preserving technologies [233]. We can assess the ability of a model to retain data privacy during the training and inference phases. In both phases, there are two types of data leakage: direct and indirect. Direct leakage at the training phase occurs when an external party gains access to non-encrypted training data sent to a centralized ES. In addition, direct leakage can also occur when in a decentralized/distributed setting when an external party gains access to activations or gradients that are sent from one edge server to another during training. Similarly, direct leakage at the inference phase occurs when an external party gains access to non-encrypted client data sent to an ES hosting the DL model. Indirect leakage at the training phase occurs when an external party gains access to DL model parameters which can indirectly provide information regarding training data. Indirect leakage from the inference phase comprises results provided by the DL model, which can leak sensitive information regarding the data DL model, is trained upon, i.e., membership inference attack and model inversion attack. Well-established encryption algorithms manage direct leakage from non-encrypted data at the training and inference phase, like DES, 3DES, AES, RSA, and blowfish, which don’t require evaluation [234].

During the training phase, we can use the mutual information score (MIS) to measure the level of direct leakage (activation or gradients being sent from one edge server to another) or indirect leakage (access to DL model parameters) [121].

The mutual information score (represented as $I$ , in equation 6) measures how much information a random variable $X$ (e.g., smashed data/model parameters) can reveal about another random variable $Y$ (e.g., non-encrypted training data/another set of model parameters). For $X$ and $Y$ with joint distribution of $p(x,y)$ , it is defined as follows:\begin{equation*} I(X, Y) = \sum _{x\in X, y\in Y} p(x,y) log\frac {p(x,y)}{p(x)p(y)}. \tag{6}\end{equation*} View Source This metric ranges from 0 to 1, where 0 implies the raw data are independent of the intermediary activation vector or model parameters differ from another set of model parameters.

During the inference phase, two attacks (the model inversion attack and the memberships inference attack) can lead to indirect leakage. A model inversion attack allows an adversary to recover the confidential dataset utilized for training a supervised neural network. In an image-based model to evaluate model inversion, the structural similarity index measure (SSIM) is used to evaluate the reconstruction accuracy [235]. The magnitude of the deformation field resulting from non-linear registration of the original and reconstructed images is used to evaluate the reconstruction accuracy. The structural similarity index measure between two images $x$ and $y$ of common size $N\times N$ is:\begin{equation*} \mathrm {SSIM}(x, y)=\frac {\left ({2 \mu _{x} \mu _{y}+c_{1}}\right)\left ({2 \sigma _{x y}+c_{2}}\right)}{\left ({\mu _{x}^{2}+\mu _{y}^{2}+c_{1}}\right)\left ({\sigma _{x}^{2}+\sigma _{y}^{2}+c_{2}}\right)},\tag{7}\end{equation*} View Source where

$\mu _{x}$ the average of $x$ ,
$\mu _{y}$ the average of $y$ ,
$\sigma _{x}^{2}$ the variance of $x$ ,
$\sigma _{y}^{2}$ the variance of $y$ ,
$\sigma _{x y}$ the covariance of $x$ and $y$ ,
$c_{1}=\left ({k_{1} L}\right)^{2}, c_{2}=\left ({k_{2} L}\right)^{2}$ two variables to stabilize the division with weak denominator,
$L$ the dynamic range of the pixel-values (typically this is $2^{\# {\,\,\text {bits per pixel }}}-1$ ), and
$k_{1}=0.01$ and $k_{2}=0.03$ by default.

A membership inference attack, in contrast, does not recover the training data but allows an adversary to query a deployed DL model to infer whether or not a particular example was contained in the model’s training dataset. An adversary in this approach trains another DL model to infer whether a specific example was present in the training dataset. Accuracy as a metric is utilized to evaluate the quality of the adversary’s DL model. One of the current metrics proposed to measure membership inference is the ratio of the true-positive rate to false-positive rates. This metric provides more strict measures to make the DL model provide a guarantee that in an ideal scenario, none of the positive cases should be incorrectly identified. This metric becomes different from AUC- ROC curve as TPR is only reported for fixed low FPR (e.g., 0.001% or 0.1%) [236].

G. Energy Consumption

There is a wide range of available DL models, and their individual energy requirements for computation can vary significantly. For some resource-constrained environments, it becomes infeasible to host models with a larger energy footprint [237]. The energy requirements of different models should be evaluated for their training and inference phase at the all in-edge level [238], [239], [240]. Power consumption (watts and kilowatts units) as measurement can be utilized to determine energy consumption [241]. Clearly, this metric is particularly relevant to an ES when hosting all the parts of a deep learning model.

H. Memory Footprint/Model Size

For an ES with limited computational resources, it can be challenging to host a DL model with a huge number of parameters. The larger the DL model the more parameters it will have, and consequently the more memory space (in RAM) required to host the model. Model size or memory footprint is computed having ‘MB’ as their unit of measurement [242], [243], [244], [245], [246]. For a specific image classification problem, if MobileNet V2 with 3.54 million parameters is selected, it will have 14 MB as model size whereas if InceptionV4 with 42.74 million parameters is selected for the same problem, it will have a 163 MB model size requirement [241].

I. Combined Metrics

As all in-edge needs to satisfy multiple constraints (i.e., energy, quality of DL model, latency etc.), it becomes more important to introduce hybrid metrics that combine multiple metrics. For example, the energy-precision ratio (EPR) [247], provides a way to combine the classification error with the energy consumed per sample. In equation 8, energy-precision ratio (EPR) is defined as:\begin{equation*} \text {EPR} = \text {Error}^{\alpha} \times \text {EPI}, \tag{8}\end{equation*} View Source where Error is the classification error, $\alpha $ is the adjustment parameter, and EPI is the energy consumption per sample.

J. Robustness

Adversarial examples can manipulate DL models, and negatively affect the models’ performance or lead to misclassification. Thus the models need to be either robust to these examples by default or integrate various defence techniques to strengthen their robustness properties. The robustness of a model is defined as the insensitivity of the model to small perturbations made to any plausible input. Moreover, the robustness can be defined as the reciprocal of the KL Divergence:\begin{equation*} \psi (x) = \frac {1}{\max \limits _{\delta \in set} D_{\text {KL}} (\hat {y}, \hat {y'})},\tag{9}\end{equation*} View Source where $D_{\text {KL}}$ is the KL Divergence, $\hat {y}$ and $\hat {y'}$ are the predictions for a sample $x$ and $x+\delta $ , respectively [248]. Another simple way to measure the robustness can be the difference in the accuracy with and without the adversarial examples. The defence techniques for robustness include PixelDP, which is a certified defence for norm-bounded adversarial samples [249], adversarial training and ensemble learning.

K. Transferability and Lifelong Learning

The ability to reuse previously learned information for a related task indicates the transferability of a model. Thus, there is no need to train the model from scratch for a new task if the model is transferable from another related domain [250]. The transferability can be measured by using transfer accuracy and LEEP score ($T(\theta, D)$ ) [251].\begin{equation*} T(\theta, D) = {\frac {1}{n}} \sum _{i=1}^{n}\left({\sum _{z\in Z} \hat {P}(y_{i}|z), \theta (x_{i})_{z}}\right),\tag{10}\end{equation*} View Source where $x_{i}$ is a sample, $\theta $ is the source model, $D$ is the target dataset, $z \in Z$ is the label in the label set $Z$ of the source task, and $\hat {P}(y_{i}|z)$ is the conditional distribution of the predicted labels given an original label.

The environment can change over time, and the model needs to adjust accordingly to capture the changes. Thus, models capable of lifelong learning are preferable. In lifelong learning, the model retains the previously gained knowledge and also keeps learning new information with time. Overall, transferability and lifelong learning capability make the DL models data and computation efficient.

SECTION V.

Open Challenges and Future Direction

Thus far, we have discussed DL architectures, technologies, adaption techniques, and the key performance indicators required to facilitate DL to all in-edge. In this section, we now articulate the key open challenges and future research directions in the area of DL at all in-edge.

A. Challenges with Resource-Constrained Edge Servers

It’s necessary to know the configuration of ESs before starting the training and deployment of the DL model at ESs. This Section discusses challenges that arise from heterogeneous ESs provided by different edge infrastructure providers (e.g., Motorola Solutions, Hikvision, ADT) and associated future directions of research.

1) Memory Efficiency

There are significant challenges to facilitating both the training and inference of DL models on ESs due to the limited resources and heterogeneous configuration of different ESs. DL models can vary significantly in their overall size. For example, inception-v3 has a size of 91 MB [252], while vgg-19 has a size of 548 MB [253]). Thus, based on the selected DL model (assuming it to be vgg-19) and enabling technology (assuming it to be federated learning), it can become impossible for some ES to participate in DL model training due to insufficient memory (if memory is less than 548 MB). The lack of availability of certain ESs can negatively impact the DL model convergence rate (a small number of available ESs for distributed training will mean a slower convergence rate). Also, due to the fairly large size, some DL models will be limited to being deployed for inference at a small number of ESs. In the future, explore the direction of utilization of heterogeneous ESs by answering: How can we train a DL model at ESs where some models can train fairly large models due to extensive memory, and some can partially train those models? How can we design DL models to facilitate training across heterogeneous ESs?

2) Energy Requirement

As ESs in remote locations can be battery-powered, minimizing energy consumption is a critical ongoing challenge. One way to achieve it is by limiting the computation required in the training and inference phase, which inherently lowers the energy requirement. Another important avenue of research is to investigate the performance of battery-operated ESs when different DL models are trained and deployed. While chipset designers continuously strive to reduce the energy requirements of their products (GPUs, TPUs, etc.). The same understanding of the interaction of the rest of ES composition (computing chipset, storage drives, batteries, etc. are required) to find a fair trade-off between battery management and compute resources is required.

B. Quality of Service (QoS) Attributes For DL Model at an All In-Edge Level

In order to be competitive with a centralized cloud model, the all in-edge model needs to provide quality of service guarantees. This Section discusses the “DL model at all in-edge guarantees” to build a complete all in-edge framework for DL.

1) Low Latency

Low latency is the first attribute that needs to be fulfilled at an all in-edge level. Low latency can be achieved by providing faster communication during model training and a quicker inference response from a deployed DL model. Due to the closer proximity of a deployed DL model to the end users, reduced latency has been observed in edge-based models compared to the traditional cloud-based models. For real-world applications, DL applications like image segmentation, object detection, etc., require very low latency. Using edge-based DL models, academic and industrial researchers actively seek ways to reduce latency [254], [255], [256], [257], [258]. Although progress has been made in this area, the current state-of-art still results in significant latency, specifically when dealing with high dimensional input data (e.g., image, time series). For example, a constrained model architecture can process between 5 to 15 frames per second (fps) with an image resolution set to $1920 \times 1080$ [259]. However, processing 5 to 15 frames per second is relatively lower than the fps at which videos are captured (typically 24 fps higher). Processing a higher number of frames will result in delayed latency at the inference stage; This remains an open research problem and as such new techniques are required to deal with high dimensional data.

Similarly, model compression provides approaches to reduce latency by enabling larger DL models to be deployed at an ES. This reduces the computation required (as the DL model is quantized and compressed). However, DL networks have continued to grow in size (leading to a corresponding increase in the number of parameters). This necessitates further research on providing more powerful compression techniques for DL networks.

2) Heterogeneous Data Distribution and Asynchronous Edge Server Participation

The second attribute required by the DL model at the all in-edge level is its ability to be trainable at ESs with heterogeneous data distribution. Heterogeneous data distribution is caused by the non-Identical and Independent Distribution (non-IID) of the data among the multiple ESs, which leads to severe statistical heterogeneity challenges when training a DL model. For example, one extreme case is when an end server only has data from a particular class. Usually, DL algorithms trained in a distributed environment with multiple ESs with an overall non-IID data distribution will perform poorly [260]. This opens an interesting future direction of research. Adaptive optimization is one approach that can be used to improve the convergence speed of a DL model and can effectively mitigate the concerns of non-IID data distribution. For example, [261] proposed adapting FedAvg to use a distributed form of Adam optimization to implement adaptive federated learning, which converges to a target accuracy in $6 \times $ fewer rounds than compressed FedAvg. In the future, exploring momentum, adaptive optimization, learning rates, and other hyperparameters is a worthwhile research direction in the context of a non-IID distribution. In addition, the participation of the ESs can be inconsistent due to communication or computational reasons. This results in either a slowdown in the convergence rate or an inability to converge at all [262], thus placing more emphasis on asynchronous ES participation in the training phase. Future research into the robustness of training models in such scenarios is also warranted.

C. Privacy and Security Concerns

Despite the rapid development of privacy-preserving DL [263] and security mitigation techniques [264] in recent years, there are still open research challenges that need to be addressed. This Section discusses potential open research problems and future directions regarding privacy and security concerns impacting DL model development and deployment at all in-edge.

1) Privacy-Preservation

Providing adequate privacy preservation for DL applications is an area with open research challenges. To preserve the privacy of the client’s data at the ES, different enabling technologies are utilized with or without cryptographic techniques, perturbation techniques, and anonymization techniques [265], [266]. While these techniques provide a means of better safeguarding client data, they struggle to maintain the original level of model performance simultaneously. For example, the inclusion of these techniques can not only negatively impact the predictive performance of a model (accuracy, F1-score, etc) [103], [267], [268] but can also significantly lengthen the training [99], [269], and inference [270], [271]) time of a model. Therefore, there are opportunities in this area to preserve privacy while mitigating the negative consequences outlined above.

2) Security

The ESs need active participation while enabling DL at the all in-edge level. However, due to hardware constraints (e.g., low computational capability) and software heterogeneities of the ESs, this also represents an increase in the attack surface. Moreover, various attacks such as Distributed Denial-of-service(DDoS) targeting network/virtualization infrastructure, side-channel attacks targetting user data/privacy, malware injection targeting ES/devices, authentication and authorization attacks targeting ES/devices and virtualization infrastructure are possible for all in-edge computing system [272]. However, finding efficient and suitable countermeasures for these attacks is challenging due to constantly evolving attackers’ tactics, techniques, and procedures [273]. Besides, DL approaches such as federated learning and split learning for edge intelligence suffer from adversarial attacks on the federated models to modify their behavior and extract/reconstruct original data [274]. To that end, novel techniques are required to identify security attacks/breaches and mitigate such attacks in the future.

D. Framework and Architectural Changes to Facilitate DL Models at All In-Edge Level

The convergence of DL at the all in-edge level is a relatively new paradigm, with concerns about effective resource utilization, management, and interoperability amongst heterogeneous ESs, requiring new frameworks and architectural changes. This Section will discuss promising directions that can help mitigate the concerns at the convergence.

1) Microservices

As computing is getting pushed away from being cloud-based to edge-based, architectures to facilitate DL model training and deployment are also shifting from monolithic entities to graphs of loosely-coupled microservices [275]. Microservices provide a promising way of modularizing DL-based applications at the process level. For example, a single DL application can be decomposed into a non-overlapping atomic set of services in a microservice architecture. However, sometimes one DL model inference can depend on another DL model inference. At the same time, another DL model may require different computing languages (e.g., python, R), language dependencies (e.g., PyTorch, TensorFlow), and software dependencies (e.g., pycharm, GitLab). Microservices architectures provide a means for those DL models with different requirements to communicate effectively. Currently, the introduction of microservices for deploying and training at the edge is at a very early stage [276]. The research opportunity exists to build a robust microservice framework that can handle the deployment and management of the DL model. Another opportunity lies in migrating microservices-based DL applications from development to production with minimal downtime.

2) Management of DL-Based Applications at ESs

The confluence of DL models deployed at ESs and the emergence of smart cities has led to a new interesting research area of DL-assisted smart cities. With many DL models deployed at ESs in smart cities, it will become challenging to predict the future requirement of resources for DL computation accurately. Real-time optimization will be required amongst ESs to accommodate heterogeneous computation and communication adaptively. As a result, better resource orchestrators (online ES management applications) will be required at the edge to facilitate the potentially large number of requests that will be generated within an ecosystem of smart cities. Also, with every government taking steps toward smart cities, these orchestrations will be dispersed across different geolocations and regions, thus providing an opportunity for collaboration between individual orchestrators. A flexible coordination mechanism between orchestrators situated adjacent to each other will be required, which can also preserve citizens’ privacy. An emerging research direction is utilizing AI to tackle the design complexity of interconnected smart cities; One of the ways to achieve it is by using deep reinforcement learning (DRL) [277]. A distributed DRL-based scheme can provide an efficient way to solve the data-driven interference mitigation and resource allocation problem. It also opens up new research opportunities on the need to develop a uniform API interface for ubiquitous heterogeneous ESs to ease the deployment of orchestrators. Due to the highly dynamic nature of this environment (any ES can go offline and come back into service), an important and related research direction is the design of efficient service discovery protocols. Service discovery protocols will provide necessary information to companion ESs regarding what can be expected from DL-based applications deployed at that ES.

3) Designing Application Framework to Facilitate DL at the All In-Edge Level

All in-edge paradigm requires new ways of designing applications. In Section III-A, we presented different architectures capable of pushing AI to the ES with varying application requirements. With the enabling technologies explained in Section III-B) and model adaption techniques described in Section III-C, developing DL applications becomes progressively more complex. The aforementioned microservices-based architecture is another exciting area of research in the provisioning of DL-based applications at ESs [278]. Although other research provided the framework for designing DL-based applications by utilizing ESs, they all remain confined to the problem they tried to resolve. For example, in [279] provided a framework for the self-learning DL model, in which authors proposed a GAN-based synthesis of the traffic images. The proposed framework remains applicable only for video-based scenarios. Similarly, the work in [280] provides a framework that was restricted to work for web traffic anomaly detection. Likewise, other research [281], [282] has its niche, and the proposed framework is restricted to solving the specific problem type. To the author’s knowledge, Open EI [283] is the only framework that provides a generic approach to facilitate the development of applications for a wide range of problem domains (computer vision, natural language processing, etc.). Still, this framework lacks the components of hardware (choices in the selection of hardware accelerators that can help in faster DNN computation [284], [285], [286], [287]) and the deployment of the DL-based services (how to distribute load and develop a global model across the ES III-B). Therefore, there is a need to find a robust framework that can facilitate the easy development and deployment of complex DL-based applications at the all in-edge level by providing guarantees from DL-based applications (as mentioned in Section V-B) while adhering to infrastructural constraints of the ES resources (as discussed in Section V-A) alongside mitigating the privacy and security concerns (as described in Section V-C).

SECTION VI.

Conclusion

This paper reviewed the current states to facilitate the training and inference of DL models on a fine mesh of ESs (referred to as all in-edge level). The behavior of centralized, decentralized, and distributed architecture were discussed from the ES’s perspective to find a trade-off between simplicity (by centralized architecture) or achieving reliability (by utilizing a decentralized and distributed architecture) for DL models deployed at the all in-edge level. Technologies facilitating the DL training and deployment across ESs were described, which leverage the layer structure of the DL models and closer proximity to the origin of the data. Federated learning and split learning as enabling technologies were more effective than others as they provided enhanced privacy while training and providing inference from the DL model. Model adaption techniques were found to be necessary at all in-edge, providing benefits of minimizing the energy requirement, lowering communication message size, and decreasing the memory footprint. In addition to general performance indicators, this paper identified and put forward additional key performance indicators, measured in silos in several works but not considered to be evaluated simultaneously. Many research directions remain open regarding optimizing memory and energy of resource-constrained ESs for facilitating DL at ESs while preserving the privacy of the user’s data, incorporating advancements in cybersecurity to diminish security concerns, and lastly, close collaboration with networking technologies (such as network functions virtualization). With new technological innovations, shifts in DL-based application design, networking technologies improvements, and ESs hardware advances, many of the previously mentioned challenges will be mitigated. This will bring new challenges and opportunities for further innovation.

References is not available for this document.

Enabling All In-Edge Deep Learning: A Literature Review

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Preliminary

A. Introduction to the Cloud and Edge Computing Paradigm

1) Cloud Computing

2) Edge Computing

B. All In-Edge Level of Edge Intelligence

Deep Learning at All In-Edge

A. Architecture

1) Centralized Architecture

2) Decentralized Architecture

3) Distributed Architecture

B. Enabling Technologies

1) Model Parallelism/DNN Splitting

2) Aggregated Frequency Control (AFC)

3) Gossip Training

4) Gradient Compression

5) Data Parallelism

6) Federated Learning

7) Split Learning

C. All In-Edge Model Adaption

1) Model Compression

a: Pruning

b: Quantization

c: Knowledge Distillation

d: Low-Rank Factorization

2) Conditional Computation

a: Early Exit

b: Model Selection

c: Result Cache

Key Performance Metrics of All In-Edge

A. Use-Case Specific Metrics

B. Training Loss

C. Convergence Rate

D. Latency

E. Communication Cost

F. Privacy Preserving

G. Energy Consumption

H. Memory Footprint/Model Size

I. Combined Metrics

J. Robustness

K. Transferability and Lifelong Learning

Open Challenges and Future Direction

A. Challenges with Resource-Constrained Edge Servers

1) Memory Efficiency

2) Energy Requirement

B. Quality of Service (QoS) Attributes For DL Model at an All In-Edge Level

1) Low Latency

2) Heterogeneous Data Distribution and Asynchronous Edge Server Participation

C. Privacy and Security Concerns

1) Privacy-Preservation

2) Security

D. Framework and Architectural Changes to Facilitate DL Models at All In-Edge Level

1) Microservices

2) Management of DL-Based Applications at ESs

3) Designing Application Framework to Facilitate DL at the All In-Edge Level

Conclusion

References