Introduction
CNNs (convolutional neural networks) [1] were first presented in 1989, and they have demonstrated excellent performance in many visual tasks such as semantic segmentation [2], [3], image classification [4], and object detection [5], [6]. In particular, as hardware has developed, the performance of CNNs has increased dramatically due to the higher computational capacity of the hardware. However, most CNNs are still not as accurate as a human visual system. In recent years, many efforts have been made to improve the performance of CNNs by regularizing parameters [7], exploiting superior loss functions [8], modifying pooling operations [9], and designing more meaningful network architectures [10].
Some classic models have validated that the depth of a CNN is pivotal for its performance [11], [4]. Furthermore, many visual recognition tasks have benefitted from very deep networks [12], [13]. In many cases, a considerably deeper network indeed achieves better results than a shallower network, and we can easily obtain a higher-quality model by increasing the depth. However, a very deep CNN has many disadvantages. First, a deeper network will result in vanishing or exploding gradients [14], which can lead to a disconvergence of the results. In addition, a deeper network is associated with more parameters. In particular, in the late stages of networks, the number of parameters massively increases as the convolution channels increase, which increases the additional computational costs for these massive numbers of parameters. Moreover, more stacked layers hamper the performance of networks due to overfitting. Specifically, the model has a lower training error, but it has a higher testing error. Additionally, as the network depth increases, the accuracy becomes saturated and degrades rapidly [12].
To solve the problems caused by depth, [11] introduced an “inception module” to increase the width of a CNN and considered that visual information should be abstracted at various scales to obtain multiscale features. The “inception module” is computationally less expensive because it can utilize a set of small-scale convolutions to replace a larger-scale convolution in a layer. In a wide dense block [13], the features of all preceding layers are used as inputs into all subsequent layers. In some sense, increasing the width can be regarded as feature fusion since it fuses both the previous and posterior features of some blocks in stages. In a wide residual block [12], the inputs of a subsequent layer are extracted from not only its preceding layer but also from other layers preceding that layer by using shortcut connections. The two types of blocks increase their width and fuse the preceding features to their subsequent features, which dramatically enhances the performance of their respective networks. In object detection, some high-performance object detection architectures [40] harness high-resolution features with small scales, and some larger-scale features are extracted from a shallow layer. Due to leveraging the better localization properties of low-level features, these architectures markedly improve the location accuracy and recall rate.
Based on the merits of increasing the width of CNNs, we propose a novel method that adds an auxiliary block to the different blocks of a CNN to increase the width. The auxiliary block can extract different scale low-level features, which are concatenated with the different level features extracted from different blocks of the CNN. Our method can be summarized as feature fusion between different scale low-level features and different level features. Based on the feature fusion form, we design two fusion strategies—low-level feature fusion (L-Fusion) and high-level feature fusion (H-Fusion)—and five scale low-level features, which are implemented by two CNNs (Net 1 and VGG-16-V). We assess the performance of these networks on CIFAR10 and CIFAR100. Finally, we validate our conclusions on DenseNet-BC [13] (depth = 40), ALL-CNN-C [15] (depth = 9), Darknet 19 [42] (depth = 19), Resnet 18 [12] (depth = 18) and Resnet 50 [12] (depth = 50).
The contributions of this paper can be summarized as follows.
L-Fusion, which concatenates the low-level features extracted from Auxi-Block with the low-level features extracted from BaseNet, can efficiently enhance the performance of networks, regardless of whether the low-level features extracted from Auxi-Block are the same scale as the low-level features extracted from BaseNet.
In terms of feature fusion, the nonsquare convolutions
and$5\times 3\cup 3\times 5$ are not remarkably better than the square convolutions$7\times 3\cup 3\times 7$ ,$3\times 3$ and$5\times 5$ .$7\times 7$ Owing to the fewer parameters of the
convolution comparing the$5\times 5$ convolution or a larger size convolution, we select the$7\times 7$ convolution to extract different scale features for L-Fusion. Using L-Fusion, we modify the architectures of DenseNet-BC (40), ALL-CNN-C (9), Darknet (19), Resnet (18) and Resnet (50) to verify our conclusions. The experimental results show that the performances of the five modified networks are more competitive.$5\times 5$ Only for our method does channelwise concatenation perform better than elementwise summation. Furthermore, the Concatenation + CR operations are more applicable.
Related Work
Currently, CNNs have a typical structure: stacked convolutional layers are followed by one or more fully connected layers. More precisely, before going across a convolutional layer, the input parameters will implement two consecutive operations: batch normalization [7] followed by linear activation (ReLU) [16]. To reduce the dimensionality and decrease the overfitting of a large network, downsampling will be necessarily employed after a linear activation layer. Because of the typical structure in conjunction with the principle of backward propagation, there are three principal research directions to enhance the performance of deep CNNs:
1) optimizing the convolution operation and pooling operation, 2) modifying the activation function and loss function, and 3) increasing the depth and width.
A. Optimizing the Pooling Operation and Convolution Operation
In CNNs, convolution is the most basic operation, and a pooling layer always follows a certain convolutional layer. These layers are very important to extract semantic features and reduce the parameters of the model. Based on the sparsity of the activations in a pooling region, [9] proposed a sparsity-based stochastic pooling mechanism that integrated the advantages of max pooling, average pooling, and stochastic pooling for CNNs to improve the recognition accuracy. According to the observation that a ranking list was invariant when the activation operation changed the values of a pooling region, [17] compared three new pooling methods, including rank-based average pooling, rank-based weighted pooling and rank-based stochastic pooling, with the conventional pooling operation on four benchmark image datasets, and the results showed that rank-based pooling outperformed the existing pooling methods in classification performance. Due to the existence of redundancy in a standard convolution, which can be divided into spatial and channel domains, [18] achieved superior performance by relaxing the sparsity of the convolution in the spatial domains and reducing the redundancy in the channel domains. Replacing the conventional convolutional layer with a multilayer perceptron, [19] designed a novel deep network by stacking multiple perceptrons and a global average pooling layer. Reference [20] proposed replacing the dense shallow multilayer perceptron with a sparse shallow multilayer perceptron to decrease the number of parameters and extract local features in channel domains and in spatial domains. To make full use of the spatial features and temporal features for video action recognition, [44] designed trajectory pooling and line pooling to fuse spatial and temporal information.
B. Modifying the Activation Function and Loss Function
The activation function is crucial for a neural network since it determines whether a neuron works, and the loss between the output value of the network and the label value is very important for backpropagation. Many research efforts have been made in these two respects to improve network performance. Reference [21] improved the fine-grained image classification accuracy by applying a generalized large-margin loss to AlexNet, GoogLeNet and VGG. Reference [22] proposed a deep CNN with an associated objective function that consisted of a max-margin objective, a max-correlation objective and a correntropy loss for a minimum score of positive labels, a latent semantic space and a minimal training loss. To eliminate the requirement of carefully tuning the learning rate to prevent exploding gradients, [23] designed a multitask loss function to conduct joint training for classification and a bounding box regression. Reference [24] designed a fast exponentially linear unit with a rectified linear unit (ReLU) and an exponential linear unit (ELU). It used the ELU on the negative part and the ReLU on the positive part to accelerate the calculation speed and improve the robustness of the network. Reference [25] employed different activation functions, including the rectified linear unit (ReLU), leaky ReLU (LReLU), parametric ReLU (PReLU) and exponential linear unit (ELU), in different convolutional layers to extract better information.
C. Increasing the Depth and Width
In 2012, researchers began to study the importance of the depth (the number of layers) of CNNs. By using an ablation study and occlusion experiment, [26] demonstrated that the depth of networks, rather than any individual section, was vital to their performance. Reference [27] increased the depth by using a set of subnetworks to extract complementary features and the same number of classifiers based on random projections.
The width (the number of units of each layer) is as important as the depth for the performance of CNNs. Reference [10] introduced a novel network form by stacking a series of “inception modules”, which consisted of a group of different scale convolutions in one layer. This approach increased both the depth and the width of the network, which improved the classification performance. Subsequently, an increasing number of researchers have studied how the width of a network influences network performance, and the “inception module” has been continuously upgraded. By factorizing convolutions of the original “inception module” [10] and adding batch normalization to diminish the computational cost, [52] proposed two modified architectures: Inception v2 and Inception v3. Reference [53] constructed the Inception v4 architecture by simplifying the architecture of Inception v3 and adding more “inception modules”.
Reference [12] added a skip connection that bypassed the nonlinear transformation with an identity shortcut:
Inspired by the human retina mechanism, [28] suggested a coupled convolutional layer that consisted of a set of mutually constrained convolutions. Reference [29] used a fully convolutional two-stream fusion network to extract deep features from input images and user interactions individually, which achieved better image segmentation performance. Instead of using large-scale convolutions to construct a deep CNN directly, [30] proposed a series of cascaded subpatch convolutions that included a small-scale convolution and a
Proposed Method
To decrease the number of network parameters, the current prevailing design of CNNs implements a downsampling operation (such as max pooling and average pooling) after some convolutional layers to obtain downsampled features. We divide a CNN into different blocks on the basis of downsampling. If we use downsampling at an early or a late stage of the network, we will obtain a shallow feature (low-level) or deep feature (high-level), respectively. As shown in Fig. 1, BaseNet consists of 4 blocks. We treat the features in a block, which is a series of stacked consecutive convolutional layers, without downsampling as the same size; we regard the features as the same scale if they are extracted from the same scale convolutions. The features extracted by Block 1 or Block 2 are the shallow features (low-level), and the features extracted by Block 3 or Block 4 are the deep features (high-level).
We build BaseNet and AuxiNet for multiscale feature fusion. BaseNet consists of 4 blocks, and each block has many stacked convolution layers without downsampling. Namely, the features generated by every convolution layer in a block have the same size. AuxiNet consists of only one block, which is exploited to extract the low-level features at different scales. We define two feature fusion strategies, L-Fusion and H-Fusion, which fuse the low-level features extracted from AuxiNet to the low-level features and the high-level features extracted from BaseNet, respectively.
A. Different Feature Fusion Strategies
To some extent, increasing the width of the network can be regarded as feature fusion because the features extracted by many different convolutions in one layer or by multiple parallel subnetworks will finally be fused. The level features are important for classification tasks. In every block of a network, the features produced by preceding convolutional layers can be fused to the features extracted from the subsequent layers to improve the robustness of the network, which has been verified by [12] and [13]. To study the influence of fusing low-level features with high-level features, we design four different feature fusion strategies, which are shown in Fig. 1. We add another block that can generate different scale low-level features using different scale convolutions. For the sake of clarity, we call the adding block Auxi-Block, which is shown in Fig. 1. Auxi-Block has the same number of convolution layers as Block 1, which is employed to extract low-level features, but the scale of the convolution in each layer of Auxi-Block is different from Block 1. Hence, as shown in Fig. 1, we can fuse the features extracted from Auxi-Block to the low-level features and high-level features extracted from Block 1, Block 2, Block 3 and Block 4. We call the four aforementioned fusions L-Fusion 1, L-Fusion 2, H-Fusion 1, and H-Fusion 2. These four fusion strategies not only increase the width of the network but also use multiscale features, which intuitively enhance the performance of CNNs.
We define an input feature \begin{align*} \left |{\begin{matrix} X_{1}=F[Conv2d\left ({{ \boldsymbol X}\ast K_{1} }\right)+b_{1}]&\in R^{W\times H\times C} \\ X_{2}=F[Conv2d\left ({X_{1}\ast K_{2}}\right)+b_{2}]& \in R^{W\times H\times C} \\ \qquad \ldots \ldots & \\ X_{l}=F[Conv2d\left ({X_{l-1}\ast K_{l}}\right)+b_{l}]& \in R^{W\times H\times C} \\ \end{matrix}}\right |\tag{1}\end{align*}
When \begin{align*} \left |{\begin{matrix} O_{L-Fusion 1}=cat\left ({O_{a},O_{1}}\right)&\in R^{W_{1}\times H_{1}\times (C_{a}+C_{1})} \\ O_{L-Fusion 2}=cat\left ({O_{a},O_{2}}\right)&\in R^{W_{2}\times H_{2}\times (C_{a}+C_{2})} \\ O_{H-Fusion 1}=cat\left ({O_{a},O_{3}}\right)&\in R^{W_{3}\times H_{3}\times (C_{a}+C_{3})} \\ O_{H-Fusion 2}=cat\left ({O_{a},O_{4}}\right)&\in R^{W_{4}\times H_{4}\times (C_{a}+C_{4})} \\ \end{matrix}}\right |\tag{2}\end{align*}
Take L-Fusion 1 as an example. We hypothesize that Auxi-Block and Block 1 have
When the outputs of Auxi-Block and Block 1 are concatenated, the fusion feature can be computed by Equation (4). Concatenation means that one weight matrix is stacked on another weight matrix; therefore, the loss of backward propagation influences \begin{equation*} Fusion=F\left [{cat(X_{L},X_{L}^{\prime })\ast w+b }\right]\tag{4}\end{equation*}
B. Multiscale Features
The convolution operation is a crucial element of deep learning structures since a number of filters slide across the input image [33]; hence, the filters are pivotal for the convolution operation. The larger a scale filter is, the larger the receptive field is. In some cases, increasing the scale of the convolution filter indeed improves the classification accuracy. For instance, many CNNs use a large filter, such as
There are few networks that consist of a set of larger scale convolutional layers entirely, such as
Due to the small number of convolutional layers and the few channels in each convolutional layer in the preceding stages or blocks before the first downsampling, we stack a series of convolutional layers with a large-scale filter to construct a block rather than utilizing a single large-scale convolutional layer. The block not only ensures that we extract different scale features but also adds fewer parameters, which makes it the most different from other methods.
Experimental Results
In the first stage, according to the aforementioned definitions in section 3, we design two networks, Net 1 and VGG-16-V, based on the architectures of [13] and [11], respectively, to address the two important problems: 1) whether the fusion of multiscale features achieves better performance and 2) which scale feature is more effective. The configurations of the two networks and the feature fusion strategies are shown in Fig. 2 and Fig. 3, respectively. There are four feature fusion strategies at every scale, as shown in Fig. 1. Hence, there are 20 feature fusion experiments in total for a network. In the second stage, we select five networks, DenseNet-BC (depth = 40), ALL-CNN-C (depth = 9), Darknet 19 (depth = 19), Resnet 18 (depth = 18) and Resnet 50 (depth = 50), to verify the conclusions generated in the first stage.
Feature fusion strategies for Net 1. We select Auxi-Blocks 1, 2, 3, 4 and 5 in turn to extract five different scale features. Then, we use every feature to implement four feature fusion strategies. We ignore the first convolution layer of Net 1 and Auxi-Block.
Architecture of VGG-16-V, which is a variant of VGG-16. We also design the same features with different scales and the same fusion strategies as Net 1. Note that the “Conv2d” layer shown in the figure corresponds to the sequence BN-ReLU-Conv2d.
A. Datasets
We evaluate the proposed fusion strategies in the first stage on two standard benchmark datasets [36]: CIFAR10 and CIFAR100. These datasets both contain 50K training images and 10K test images, but they consist of 10 categories and 100 categories, respectively. We use all the 50K training images for training without validation during the training stage, and the 10K test images are used for testing during the testing stage. We normalize the data using the channel means and standard deviations as in [46]. During training, we adopt a data augmentation scheme with random cropping, random horizontal flips and normalization [47], which has been widely used for the two datasets [11], [12], [19] to obtain two augmented datasets that we call CIFAR10+ and CIFAR100+, respectively. During testing, we only normalize the data by using the channel means and standard deviations [47].
We adopt the ILSVRC 2012 classification dataset [45], which consists of 1.2 million images for training and 50,000 for validation from 1,000 classes, to further validate the effectiveness of our proposed method. We use the training set during the training stage and report the classification errors on the validation set. During training, we randomly crop the size of the training images to
B. Training
We implement these proposed networks on the PyTorch framework and two NVIDIA GeForce RTX 2080 GPUs. The weight initialization strategy is also introduced in [37]. For CIFAR10 and CIFAR100, all the networks are trained using stochastic gradient descent (SGD), cross-entropy loss and the ReLU. The weight decay, momentum and initial learning rate are set to 0.0001, 0.9 and 0.1, respectively. All the models are trained for 300 epochs, the learning rate is divided by 10 at the 150th and 225th epochs, and the batch size is 256. Based on Darknet 19 [42], Resnet 18 [12] and Resnet 50 [12], we implement a fusion operation on the ILSVRC 2012 classification dataset and train the three networks and their correspondingly modified networks only for 90 epochs with stochastic gradient descent (SGD), cross-entropy loss and ReLU. For Darknet 19 [42], leaky ReLU (negative slope equals 0.1) replaces ReLU. The initial learning rate is set to 0.1 and is divided by 10 at the 30th, 60th and 75th epochs. The weight decay and momentum are set to 0.0001 and 0.9, respectively. Due to the limitation of the GPU memory, the batch size is set to 64.
C. Classification Results of Net 1 and VGG-16-V
We construct two CNNs, Net 1 and VGG-16-V, to implement the feature fusion strategies. The input image size is
Net 1, which is inspired by the architecture of [19], consists of 45 convolutional layers, and we build its variants Net 2 and Net 3. One of our purposes of this paper is to study the advantage of multiscale feature fusion, but we also seek to answer whether large-scale feature or multiscale feature fusion increases performance more. Therefore, we construct Net 2 and Net 3 to observe the performance of the large features. Their configurations are shown in Table 1. Except for the different convolutions in the first block, the other settings are the same in Net 1, Net 2 and Net 3. The convolutions of the first block are
We construct another network based on VGG-16 [11], which is a famous network that has been widely utilized for other CNNs, such as in [34] and [35]. Because of its good classification performance, we chose VGG-16 to implement the proposed methods. We use only one fully connected layer rather than three fully connected layers, and we call this VGG-16 variant VGG-16-V, as shown in Fig. 3.
Based on the Net 1 and VGG-16-V architectures, we design four fusion strategies, and each strategy leverages five features at five scales. The details of the two networks and configurations are shown in Fig. 2 and Fig. 3, respectively. We select the
Table 4 shows the classification errors of the four fusion strategies when the features extracted from different Auxi-Blocks are concatenated with the features generated from the different blocks of Net 1. Comparing the second column and the fourth column of Table 2, we find that all L-Fusion strategies improve the performance for both CIFRA10+ and CIFRA100+. On CIFAR10+, the best result shows that we reduce the error by 1.12% by adding the
For VGG-16-V, the results of the experiments are shown in Table 6. Similarly, we find that all L-Fusion strategies can improve the performance for both CIFRA10+ and CIFRA100+. Furthermore, most H-Fusion strategies lead to poor performance. In terms of the best result, on CIFRA10+, we reduce the error by 1.59% by using the
D. Fusion Operation
From Table 4 and Table 6, we find that all L-Fusion strategies result in better performance. Specifically, multiscale low-level feature fusion can remarkably improve the performance of CNNs. We consider these results to have very important statistical meaning. Moreover, we do not find that the nonsquare convolutions,
As we discussed above, feature fusion is very important for enhancing the performance of CNNs. In all the aforementioned experiments, we apply channelwise concatenation to perform the feature fusion operation. Nevertheless, there is another principal fusion operation: channelwise summation. In this study, we harness the two operations to research which style is more suitable for the method we proposed and implement the two operations in VGG-16-V. Simultaneously, we emulated a multilayer perceptron (MLP) introduced by [19]. Following two fusion operations, we also implement a nonlinear activation operation: a
As depicted in Table 7, by integrating the CR operation, concatenation and summation were observed to improve performance. This result illustrated that the CR operation is helpful for improving performance. Moreover, the concatenation outperforms the elementwise summation, especially on CIFAR100+. We considered that the features extracted by different blocks are discriminative and independent. Elementwise summation will mutually disturb features, while concatenation can maintain the independence between features. Hence, we suggest using Concatenation + CR fusion operations and implementing these operations on the subsequent verification networks.
E. Classification Results of Five Verification Networks
To further support our hypotheses, we verify the fusion strategy on DenseNet-BC (depth = 40) [13] and ALL-CNN-C (depth = 9) [15] due to their different depths and high classification accuracy. We select the CIFAR dataset to evaluate the classification performance. The configurations of the two verification networks are shown in Table 8, and the results are shown in Table 9. Because the transition layer of Block 1 of DenseNet-BC has the same function as Concatenation + CR fusion operations, we delete it.
From Table 9, we find that the approach we proposed significantly improves the performance of DenseNet-BC and ALL-CNN-C. On CIFAR10+, we improve the accuracy of DenseNet-BC and ALL-CNN-C by 0.76% and 1.15%, respectively. On CIFRA100+, we reduce the error of DenseNet-BC and ALL-CNN-C by 2.25% and 4.68%, respectively. Although the number of parameters has increased greatly, the computational cost is very low.
We select the ILSVRC 2012 classification dataset to observe the applicability of the proposed approach to a larger dataset. Darknet 19 [42], Resnet 18 [12] and Resnet 50 [12] have good performance on the ILSVRC 2012 classification dataset and have different numbers of convolutional layers. Based on the L-Fusion 1 architecture, we append an Auxi-Block to the three networks to construct Darknet-19-Fusion, Resnet-18-Fusion and Resnet-50-Fusion, and the specific structures are shown in Table 10. We only show the changes brought to these three networks by the proposed method, and the remaining architectures of these three networks remain unchanged. We append two parallel
F. Applicability
The residual block [38], dense block [13] and inception module [11] are fundamental components used to construct high-performing architectures, especially the first design. These three layouts fuse features on a layer basis. The residual block and dense block integrate the features extracted from a certain layer in front with the features extracted from the layers behind, and the inception module exploits multiple filters in a layer to represent the features. Nevertheless, we consider fusing the features extracted from different blocks, and the fusing operation is only used in the preceding stages of CNNs. This approach provides the most discrimination between the state-of-the-art method and our method.
In this paper, one of the intentions is to study whether the fusion of low-level features at different scales into different stages of a CNN improves its performance. The experimental results show that L-Fusion is helpful for enhancing the performance of CNNs. Moreover, by applying the L-Fusion 1 structure, which is shown in Fig. 1, to five verification networks, we further verify our conclusions. The Auxi-Block shown in Fig. 4 is easy to build and only adds a small overhead over the model parameters and computation. Hence, our method can update some off-the-shelf networks by elaborately designing the structure we proposed. We suggest building an Auxi-Block according to the structure of the first block of a CNN; however, we cannot ensure that our approach will work when the first block has too many convolutional layers.
Proposed architecture. Auxi-Block and Block 1 are two parallel blocks. In terms of an off-the-shelf network, we can improve its performance simply by adding an Auxi-Block that has the same architecture as Block 1, except for a different scale convolution. The fusion operation coincides with the human retina mechanism introduced in [28], which considered that there are two types of ganglion cells with respect to the receptive field.
G. Results Analysis
CNNs can learn a hierarchy of features [39]. CNNs represent the low-level features that are visually recognizable in the preceding stage and represent the high-level features that are semantically recognizable in subsequent stages. L-Fusion can lead CNNs to learn low-level features with different scales. Extracting multiscale low-level features is why L-Fusion can enhance performance. The results shown in Tables 4, 6, 9 and 10 confirm this conclusion. Additionally, the effect of dense connectivity [13] is another reason that L-Fusion improves performance.
High-level features are gradually learned from low-level features [30]. According to the high-level features, the network can determine the object in an image. We consider that the semantically recognizable features will be turbulent if we fuse the low-level features with the high-level features for inference. Namely, the fused high-level features include strong and weak semantic features simultaneously. Therefore, H-Fusion will result in poor performance, and its results are shown in Table 4 and Table 6.
Conclusion
In this paper, we divide a CNN into different blocks according to the size of the features to obtain low-level and high-level features for feature fusion. We design two fusion strategies, L-Fusion and H-Fusion, to assess the influence of feature fusion at different stages. We select five low-level features with different scales to determine the advantage of multiscale feature fusion. L-Fusion, which fuses a low-level feature with different scales extracted from an auxiliary block to the low-level features extracted by a CNN, is observed to improve performance. The auxiliary block can be built according to the structure of the first block of a CNN. We validate the conclusion on five CNNs with high classification accuracy, and the experimental results show that our method achieves state-of-the-art performance. Simultaneously, the proposed architecture will not substantially increase the parameter of a CNN because the fusion operation takes place in the preceding stage.