Introduction
Deep Neural Networks (DNN) models are frequently used for solving difficult machine learning problems in various domains including computer vision [12], natural language processing [13] and sensory data processing [14]. Provided enough training data is available, deeper, and more complex neural architectures frequently lead to a better performance [15]. For example, for visual object classification problem, a deeper architecture (e.g., AlexNet [12]) almost always leads to a better performance. However, processing inputs using deeper and more complex neural network requires a significant amount of computing power and may increase the overall processing latency [16], [17]. These processing power requirements may prevent the deployment of deep learning applications at edge devices (e.g., smartphones or sensors) that have severe limitations in processing power or battery life [18], [19].
Distributed processing approach allows to alleviate computing power limitations by employing multiple computing nodes for performing necessary computations [20]. For instance, the DNN computation load can be partitioned between edge devices (e.g., smartphones) and more powerful server nodes. In this case, the edge devices might perform necessary computations for extracting features from the raw input (e.g., video). Then, the extracted features might be sent to a server where the rest of computations are performed. In the case of DNN, an edge device might compute few lower layers of the network, while transmitting the output of an intermediate layer to the server for the rest of the processing [21]. The distributed computing might employ more than two types of computing nodes besides the server and edge device nodes. For instance, fog computing [22] architectures define a hierarchy of computing nodes according to their computing power and their role within the platform’s network topology (e.g., gateways). Unfortunately, the distributed processing scheme might result in a significant network traffic due to data transmissions among computing nodes. Depending on the depth of a neural network and the partition of the network among computing nodes, the intermediate layers might produce tens of thousands of real-valued values that needed to be communicated between computing nodes [19].
In this work we propose an optimization scheme that extends the work of BranchyNet [2]. The BranchyNet concept suggests augmenting the original network with small intermediate decision networks attached to selected hidden layers within the DNN. These small networks (called Branches) are trained to infer network outputs (e.g., classification labels) on “easy” input cases. The branch networks are trained jointly to make decisions solely based on the values produced by intermediate layers of the main DNN. When such inference is possible, the inference through the rest of the DNN layers is interrupted, saving precious resources including network and computing capacities.
Our approach extends the original BranchyNet concept for supporting two practical requirements. We assume that the original training data might not be available during the optimization and deployment time. Moreover, the parameters of the training procedure for the main network might not be known, or training process might not be easily reproducible. We also assume that resource limitation and accuracy requirements are application dependent. Therefore, the augment model should have a “knob” providing explicit and predictable control over the tradeoff between accuracy and computational costs.
Motivated by these assumptions, we attach branches to an already trained network to address the cases where no complete copy of original training data is available during the optimization of the network performance. The network is augmented by branches that contained two “heads”: classification head and decision head. The classification head is trained to mimic the output of the original network and the decision head estimates the reliability or certainty of the match between the outputs of the branch and the original network. Depending on the threshold, the combine model continues to evaluate higher layers of the main network if certainty is too low. We evaluate the effectiveness of our approach on a standard benchmark: SVHN [23] and CIFAR10 [24].
Prior Work
There are several strategies for neural network optimization that have been explored in the past. The existing approaches can be partitioned into two categories: network complexity reduction [25] and distributed computing [22]. In this section we will cover common methods in both categories. It should be noted that we focus on techniques for run-time performance optimization during inference as compared to runtime performance optimization of training procedures. The optimization of DNN training time is also frequently discussed in this context of distributed neural networks [25].
Some of the common approaches for reducing the computation and communication requirement for DNN inference are based on weight quantization. Weight quantization methods allows to reduce the number of bits required for storing network weights. Reducing the number of bits results in simpler and faster computations [26]. There are more advanced approaches for quantization that are based on weights clustering. In these approaches, a network weight is approximated by the center of its closest cluster and, thus, can be encoded by smaller number of bits.
Network complexity reduction can also be achieved using network pruning [27] and connection sharing [28]. Network pruning techniques reduce the network complexity by dropping some less important neurons and connections. There are different methods for selecting the most optimal pruning strategies [25]. For example, one method suggests dropping all connections from a fully connected layer whose weights are lower than a predefined threshold [29]. This technique basically converts a dense layer to a sparse layer and reduces the storage and computation requirements by an order of magnitude.
Connection sharing methods allow to reduce the complexity by sharing parameters between multiple neurons [28]. For example, in CNN models the main assumption is that one filter which is useful for computation at some specific data range in one of the model layers, can also be useful for computation at a different data range in the same layer [12].
Whenever a deep learning network is deployed in a distributed environment, efficient partitions of the DNN between nodes and different communications schemes might lead to reduced communication network and processing load. For instance, [21] suggests an efficient way for mapping sections of DNN onto a distributed computing hierarchy. In [31], authors suggest to deploy shallow networks to edge devices for performing a “gating function”. The output of these shallow (auxiliary) networks is used for deciding if an input has to be transferred to the stronger backbone servers for inferring using more deep and complex neural networks.
The original approach based on early exit augmentation is based on the BranchyNet ideas [2]. BranchyNet concept suggests to augment the main deep neural network with additional side branch classifiers attached to the selected intermediate layers. The augmented network has a single-entry point and multiple exit points. The decision of the augmented network can be produced at any exit point. The output of branch networks is used for reaching and early decision and stopping all further processing. If a side branch classifier indicates a certain degree of confidence, all further processing is stopped and the decision is made based solely on the output of the branch classifier. Another early exit approach is introduced in [1] where a confidence head is added in parallel to the classifier head. The confidence sigmoid output is trained to produce confidence level output which is used at inference time as the early exit decision making mechanism.
Few other Early-Exit architectures and design implementations were introduced in the field of network cost optimization. Reference [11] presents a Learning Early Exit (LEE) scheme with an online algorithm that chooses the exit point of a DNN by performing history exploration using a reward formulation. For hardware aware approaches, [10] defines HAPI framework which uses hardware-aware design of early-exit CNNs as a mathematical optimization problem. This approach generates progressive inference networks customized and tailored for the specific target deployment platform. Furthermore, [9] presents a neural network designed for miniature edge devices, which allow distributed implementation (on both flash memory and on chip SRAM memory) of a small 2 exit network. Other early exit methodologies were introduced in [33], [34]. Reference [33] introduces an early-exit opportunities on a reference model targeting a specific class which leads to improved average classification rate for the specific class, maintaining the original model accuracy. Reference [34] presents an early exit mechanism based on class means. It obtains the means by taking the mean of layer output of each class at every layer of the model. During inference, output of a layer is compared with the corresponding class mean to stop execution.
Our Contribution
We extend the original BranchyNet model by introducing several critical enhancements:
We define a new early exit branch architecture which includes both classification output and confidence head for early termination decision making. Those branches are attached to the pre-trained backbone network to form a PTEEnet (Post Trained Early Exit network).
Branch placement is done using suitable distribution methods to allow computational cost optimization, while considering architectural constrains of the original network.
We train only the attached branches using a loss function that combines both the accumulated prediction loss and the weighted computational cost associated with the consumed computations resources. The loss function is designed to express the certainty of the branch output while considering the cost of proceeding to deeper layers.
For branch training and accuracy evaluation, we use the main pre-trained network output as a label generator to allow the use of unlabeled dataset. This allows for more practical scenarios where we don’t have the original dataset used for the original network training.
PTEE Methodology
Our PTEE (Post Trained Early exit) model can be casted on many deep neural network architectures with various branch distribution suited for each architecture. We first describe the global PTEEnet network architecture (which is based on previous EEnet work [1]) and then expand it to define the new methodology.
A. EE-Blocks Distribution
The number of the early exit branches and their placement are important factor in the model architecture. As introduced in [1], many distribution methods are suitable. Pareto method is based on the 80/20 principle where 80% of the samples are classified by 20% of the total computational cost and therefore define a 0.2 ratio between the added computational cost and the total cost calculate from each previous branch. In the same manner, Fine and Golden distribution methods define ratios to be at least 0.05 and 0.618 respectively depending on the internal design of the network. Linear distribution method defines a fixed computational cost gap. To reduce freedom degree to our exploration we use the Fine distribution method.
Figure 1 shows 3 branches distributed on ResNet20 main network, using Fine distribution method. Choosing the optimal number of branches depends mainly on main network size, distribution method and dataset characteristics.
ResNet20 model attached by 3 early exit branches. Each branch allows for early termination of samples propagation by applying confidence head threshold.
We follow the computational cost levels defined by the distribution method to place the branches along the main network. The original network is organized in stages, defined by the placement of the attached branches. These are only semantic structure to help locate branches and define network segments. Each branch and its leading stages are grouped into segments which are used for further complexity analysis. For simplicity, branches are attached only between logic basic-blocks which in the case of ResNet architecture consist of a single residual block.
B. Cumulative Predictions and Computational Cost
We begin our definition of our loss elements by using the cross-entropy function as the classification loss for each exit head:\begin{equation*} \mathcal {L}_{MC}=CE\left ({y,\hat {Y} }\right)=-\sum \limits _{n=1}^{K} {y_{n}\cdot \log {\left ({\hat {Y}_{n} }\right),}}\end{equation*}
\begin{align*} \hat {Y}=&I_{(h_{0}\ge T)}\cdot \hat {y}_{0}+I_{\left ({h_{0}< T }\right)} \\&\cdot \left \{{\ldots I_{\left ({h_{N-1}\ge T }\right)} \hat {y}_{N-1}+I_{\left ({h_{N-1}< T }\right)}\cdot \hat {y}_{N} }\right \}\ldots\tag{1}\end{align*}
\begin{equation*} \hat {Y}_{n}=h_{n}\cdot \hat {y}_{n}+\left ({1-h_{n} }\right)\cdot \hat {Y}_{n+1};\quad n=0\ldots.N-1\tag{2}\end{equation*}
\begin{equation*} C_{n}=h_{n}\cdot c_{n}+\left ({1-h_{n} }\right)\cdot c_{n+1};\quad \mathrm { }n=0\ldots.N-1\tag{3}\end{equation*}
\begin{align*} \mathrm {\mathcal {L}}=\sum \nolimits _{n=0}^{N-1} \mathcal {L}_{MC}^{(n)} +\lambda \mathcal {L}_{Cost}^{(n)}=\sum \nolimits _{n=0}^{N-1} {CE\left ({y,\hat {Y}_{n} }\right)} +\lambda \mathcal {C}_{n} \\\tag{4}\end{align*}
C. Training
As a preliminary stage to training our branches, we freeze the pre-trained weights of the backbone network and disable their gradients calculation. To eliminate our dependency on data labels, we use the main network outputs as the ground truth values (replacing the original labeled data). This is done by executing forward pass of each batch of training data and use the main network classifier predictions to determine the corresponding labels. These labels are used both for training and validation. The loss is back-propagated only till the stitching point of each branch and the weights are updated.
D. Inference
During inference we apply a stop criteria in the form of a dedicated confidence threshold level
where
Algorithm 1 PTEEnet Fast Inference Procedure
While
If
Return
End if
End While
Return
E. Branch Architecture
Using shallow capacity branch architecture, as in [1], is only allowed while using small size backbone networks. To account for deeper network architectures, typically ResNet20 and on, we need higher learning capacity branches. Exploring different branch architectures yield ConvX, as presented in Figure 2, which consist of sequential convolutional blocks attached to classifier and confidence heads by an average pooling layer:
Each block consists of a convolutional layer using a
F. Complexity
To measure the inference computational complexity of any network segment in our model we use FLOPs (floating point operations) as a common measurement unit used also in the original ResNet paper [32]. Each FLOP is defined by a pair of multiple-accumulate (MAC) operations. Those MAC operations are sometimes referred to as FMA (fused multiple accumulate) when using more modern hardware architectures. An exit network segment includes all stages (see Figure 1) preceding the branch attach point and the branch itself. To calculate the number of FLOPs in each exit segment we perform a forward pass and accumulate the total number of FLOPs from all segment layers. For example, the number of FLOPs in a 2D convolution layer with single stride and padding, square kernel of size
The main exit branch in each model corresponds to the main backbone network exit and yields relative cost of 1.
Experiments and Results
To evaluate our PTEEnet methodology we use 3 state of the art vanilla backbone network architectures– ResNet, VGG and DenseNet to create their corresponding PTEEnet variants: ResEEnet, VGGEEnet and DenseEEnet. We evaluate those PTEEnet using CIFAR10 and SVHN datasets. For the ResEENet architecture we use pre-trained ResNet20, ResNet32, ResNet110 backbones, and attach 3,5,10 branches respectively. This results in ResEEnet20/3, ResEEnet32/5, ResEEnet110/10 models. For VGG and DenseNet architectures, we use a VGG19 and DenseNet121 backbones, each attached with 3 branches to construct a VGGEEnet19/3 and DenseEEnet121/3. For each model we “freeze” the pre-trained weights, and train only the branches. We define cost penalty
Figure 3 presents accuracy and computational costs for different values of penalty parameter
Computational cost reduction and validation accuracy pairs generated from increasing levels of
Blue curve in Figure 3 presents accuracy and computational cost reduction behaviour of ResEEnet110/10 on CIFAR10 dataset. The model trained with
Setting a parameter value
As discussed in section IV E, during inference the decision about a sample class is made based on confidence threshold level
Since
Computational cost reduction and validation accuracy pairs generated from varied levels of confidence threshold
The final optimal confidence level threshold selection depends on the application requirements in terms of the amount of accuracy decrease allowed and computational resources at hand.
Table 2 presents a performance comparison of various PTEEnet models based on maximum 3% drop tolerance in validation accuracy. For each model we set a
Summary
In this work we proposed PTEEnet - a methodology for attaching and training early exit branches to pre-trained state-of-the-art deep neural networks. It has been shown that the output produced by the original network could be successfully used as labels for training the exists classifier and “confidence” heads removing the need for the original labeled training data. Furthermore, we used a single confidence threshold parameter for controlling the accuracy vs cost tradeoff allowing easy selection of optimal point based on specific application requirements and constraints. Using several examples, we showed that a significant reduction in average computational cost can be achieved by selecting optimal confidence thresholds while sustaining only a small impact on the overall accuracy.
Although, the applicability of the approach is not limited to any specific task, the current work demonstrates the benefits of the methods for image classification using several popular architectures of main network. The PTEEnet approach can be used alongside other neural network optimization techniques such as pruning and network compression methods that are usually performed on the main network.
Future work can explore more complex training methods. For instance, the branches head can be trained in an incremental manner with different fine-tuned confidence level thresholds for each exit. The threshold confidence level