Journals & Magazines >IEEE Access >Volume: 9

Improving the Performance of Convolutional Neural Networks by Fusing Low-Level Features With Different Scales in the Preceding Stage

A graphical abstract for Improving the Performance of Convolutional Neural Networks by Fusing Low-Level Features With Different Scales in the Preceding Stage.

Abstract:

The width of convolutional neural networks (CNNs) is crucial for improving performance. Many wide CNNs use a convolutional layer to fuse multiscale features or fuse the p...Show More

Metadata

Abstract:

The width of convolutional neural networks (CNNs) is crucial for improving performance. Many wide CNNs use a convolutional layer to fuse multiscale features or fuse the preceding features to subsequent features. However, these CNNs rarely use blocks, which consist of a series of successive convolutional layers, to fuse multiscale features. In this paper, we propose an approach for improving performance by fusing the low-level features extracted from different blocks. We utilize five different convolutions, including 3×3, 5×5, 7×7,5×3 ∪ 3×5 and 7×3 ∪ 3×7, to generate five low-level features, and we design two fusion strategies: low-level feature fusion (L-Fusion) and high-level feature fusion (H-Fusion). Experimental results show that the L-Fusion is more helpful for improving the performance of CNNs, and the 5×5 convolution is more suitable for multiscale feature fusion. We summarize the conclusion as a strategy that fuses multiscale features in the preceding stage of CNNs. Furthermore, we propose a new architecture to perceive the input of CNNs by using two self-governed blocks based on the strategy. Finally, we modify five off-the-shelf networks, DenseNet-BC (depth = 40), ALL-CNN-C (depth = 9), Darknet 19 (depth = 19), Resnet 18 (depth = 18) and Resnet 50 (depth = 50), by utilizing the proposed architecture to verify the conclusion, and these updated networks provide more competitive results.

A graphical abstract for Improving the Performance of Convolutional Neural Networks by Fusing Low-Level Features With Different Scales in the Preceding Stage.

Published in: IEEE Access ( Volume: 9)

Page(s): 70273 - 70285

Date of Publication: 03 May 2021

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2021.3077070

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

CNNs (convolutional neural networks) [1] were first presented in 1989, and they have demonstrated excellent performance in many visual tasks such as semantic segmentation [2], [3], image classification [4], and object detection [5], [6]. In particular, as hardware has developed, the performance of CNNs has increased dramatically due to the higher computational capacity of the hardware. However, most CNNs are still not as accurate as a human visual system. In recent years, many efforts have been made to improve the performance of CNNs by regularizing parameters [7], exploiting superior loss functions [8], modifying pooling operations [9], and designing more meaningful network architectures [10].

Some classic models have validated that the depth of a CNN is pivotal for its performance [11], [4]. Furthermore, many visual recognition tasks have benefitted from very deep networks [12], [13]. In many cases, a considerably deeper network indeed achieves better results than a shallower network, and we can easily obtain a higher-quality model by increasing the depth. However, a very deep CNN has many disadvantages. First, a deeper network will result in vanishing or exploding gradients [14], which can lead to a disconvergence of the results. In addition, a deeper network is associated with more parameters. In particular, in the late stages of networks, the number of parameters massively increases as the convolution channels increase, which increases the additional computational costs for these massive numbers of parameters. Moreover, more stacked layers hamper the performance of networks due to overfitting. Specifically, the model has a lower training error, but it has a higher testing error. Additionally, as the network depth increases, the accuracy becomes saturated and degrades rapidly [12].

To solve the problems caused by depth, [11] introduced an “inception module” to increase the width of a CNN and considered that visual information should be abstracted at various scales to obtain multiscale features. The “inception module” is computationally less expensive because it can utilize a set of small-scale convolutions to replace a larger-scale convolution in a layer. In a wide dense block [13], the features of all preceding layers are used as inputs into all subsequent layers. In some sense, increasing the width can be regarded as feature fusion since it fuses both the previous and posterior features of some blocks in stages. In a wide residual block [12], the inputs of a subsequent layer are extracted from not only its preceding layer but also from other layers preceding that layer by using shortcut connections. The two types of blocks increase their width and fuse the preceding features to their subsequent features, which dramatically enhances the performance of their respective networks. In object detection, some high-performance object detection architectures [40] harness high-resolution features with small scales, and some larger-scale features are extracted from a shallow layer. Due to leveraging the better localization properties of low-level features, these architectures markedly improve the location accuracy and recall rate.

Based on the merits of increasing the width of CNNs, we propose a novel method that adds an auxiliary block to the different blocks of a CNN to increase the width. The auxiliary block can extract different scale low-level features, which are concatenated with the different level features extracted from different blocks of the CNN. Our method can be summarized as feature fusion between different scale low-level features and different level features. Based on the feature fusion form, we design two fusion strategies—low-level feature fusion (L-Fusion) and high-level feature fusion (H-Fusion)—and five scale low-level features, which are implemented by two CNNs (Net 1 and VGG-16-V). We assess the performance of these networks on CIFAR10 and CIFAR100. Finally, we validate our conclusions on DenseNet-BC [13] (depth = 40), ALL-CNN-C [15] (depth = 9), Darknet 19 [42] (depth = 19), Resnet 18 [12] (depth = 18) and Resnet 50 [12] (depth = 50).

The contributions of this paper can be summarized as follows.

L-Fusion, which concatenates the low-level features extracted from Auxi-Block with the low-level features extracted from BaseNet, can efficiently enhance the performance of networks, regardless of whether the low-level features extracted from Auxi-Block are the same scale as the low-level features extracted from BaseNet.
In terms of feature fusion, the nonsquare convolutions $5\times 3\cup 3\times 5$ and $7\times 3\cup 3\times 7$ are not remarkably better than the square convolutions $3\times 3$ , $5\times 5$ and $7\times 7$ .
Owing to the fewer parameters of the $5\times 5$ convolution comparing the $7\times 7$ convolution or a larger size convolution, we select the $5\times 5$ convolution to extract different scale features for L-Fusion. Using L-Fusion, we modify the architectures of DenseNet-BC (40), ALL-CNN-C (9), Darknet (19), Resnet (18) and Resnet (50) to verify our conclusions. The experimental results show that the performances of the five modified networks are more competitive.
Only for our method does channelwise concatenation perform better than elementwise summation. Furthermore, the Concatenation + CR operations are more applicable.

The remainder of this paper is organized as follows: In section 2, we provide an overview of the related works, focusing on different modalities for improving the performance of CNNs. In section 3, we present the details of our proposed methods. In section 4, we provide the details of our experimental procedures, results and analysis. Section 5 concludes this paper.

SECTION II.

Related Work

Currently, CNNs have a typical structure: stacked convolutional layers are followed by one or more fully connected layers. More precisely, before going across a convolutional layer, the input parameters will implement two consecutive operations: batch normalization [7] followed by linear activation (ReLU) [16]. To reduce the dimensionality and decrease the overfitting of a large network, downsampling will be necessarily employed after a linear activation layer. Because of the typical structure in conjunction with the principle of backward propagation, there are three principal research directions to enhance the performance of deep CNNs:

1) optimizing the convolution operation and pooling operation, 2) modifying the activation function and loss function, and 3) increasing the depth and width.

A. Optimizing the Pooling Operation and Convolution Operation

In CNNs, convolution is the most basic operation, and a pooling layer always follows a certain convolutional layer. These layers are very important to extract semantic features and reduce the parameters of the model. Based on the sparsity of the activations in a pooling region, [9] proposed a sparsity-based stochastic pooling mechanism that integrated the advantages of max pooling, average pooling, and stochastic pooling for CNNs to improve the recognition accuracy. According to the observation that a ranking list was invariant when the activation operation changed the values of a pooling region, [17] compared three new pooling methods, including rank-based average pooling, rank-based weighted pooling and rank-based stochastic pooling, with the conventional pooling operation on four benchmark image datasets, and the results showed that rank-based pooling outperformed the existing pooling methods in classification performance. Due to the existence of redundancy in a standard convolution, which can be divided into spatial and channel domains, [18] achieved superior performance by relaxing the sparsity of the convolution in the spatial domains and reducing the redundancy in the channel domains. Replacing the conventional convolutional layer with a multilayer perceptron, [19] designed a novel deep network by stacking multiple perceptrons and a global average pooling layer. Reference [20] proposed replacing the dense shallow multilayer perceptron with a sparse shallow multilayer perceptron to decrease the number of parameters and extract local features in channel domains and in spatial domains. To make full use of the spatial features and temporal features for video action recognition, [44] designed trajectory pooling and line pooling to fuse spatial and temporal information.

B. Modifying the Activation Function and Loss Function

The activation function is crucial for a neural network since it determines whether a neuron works, and the loss between the output value of the network and the label value is very important for backpropagation. Many research efforts have been made in these two respects to improve network performance. Reference [21] improved the fine-grained image classification accuracy by applying a generalized large-margin loss to AlexNet, GoogLeNet and VGG. Reference [22] proposed a deep CNN with an associated objective function that consisted of a max-margin objective, a max-correlation objective and a correntropy loss for a minimum score of positive labels, a latent semantic space and a minimal training loss. To eliminate the requirement of carefully tuning the learning rate to prevent exploding gradients, [23] designed a multitask loss function to conduct joint training for classification and a bounding box regression. Reference [24] designed a fast exponentially linear unit with a rectified linear unit (ReLU) and an exponential linear unit (ELU). It used the ELU on the negative part and the ReLU on the positive part to accelerate the calculation speed and improve the robustness of the network. Reference [25] employed different activation functions, including the rectified linear unit (ReLU), leaky ReLU (LReLU), parametric ReLU (PReLU) and exponential linear unit (ELU), in different convolutional layers to extract better information.

C. Increasing the Depth and Width

In 2012, researchers began to study the importance of the depth (the number of layers) of CNNs. By using an ablation study and occlusion experiment, [26] demonstrated that the depth of networks, rather than any individual section, was vital to their performance. Reference [27] increased the depth by using a set of subnetworks to extract complementary features and the same number of classifiers based on random projections.

The width (the number of units of each layer) is as important as the depth for the performance of CNNs. Reference [10] introduced a novel network form by stacking a series of “inception modules”, which consisted of a group of different scale convolutions in one layer. This approach increased both the depth and the width of the network, which improved the classification performance. Subsequently, an increasing number of researchers have studied how the width of a network influences network performance, and the “inception module” has been continuously upgraded. By factorizing convolutions of the original “inception module” [10] and adding batch normalization to diminish the computational cost, [52] proposed two modified architectures: Inception v2 and Inception v3. Reference [53] constructed the Inception v4 architecture by simplifying the architecture of Inception v3 and adding more “inception modules”.

Reference [12] added a skip connection that bypassed the nonlinear transformation with an identity shortcut: $X_{l}=H\left ({X_{l-1}}\right)+X_{l-1}$ . Reference [13] introduced a dense connection that connected any layer to all subsequent layers: $X_{l}=H(\left [{X_{0},X_{1},\ldots,X_{l-1}}\right])$ , where $X_{l}$ denotes the features extracted from the $\ell ^{th}$ layer (or the input feature if the layer is input layer); $H(\cdot)$ represents the output of a composite function of some consecutive operations, such as batch normalization, ReLU, and convolution; and $\left [{X_{0}, X_{1},\ldots,X_{l-1} }\right]$ represents the concatenation of the features abstracted by layers $0, 1, \ldots, l-1$ . To improve the adaptation of the inception architectures to new tasks, [48] exploited a split–transform–merge strategy to design a new module. The module divided the channels of a residual unit [12] into a number of channel sets with the same topology and then summed the outputs of these sets. Reference [49] designed a “res2net module” that replaced the $3\times 3$ filters of a residual unit with smaller filter groups. The receptive fields of these filter groups were different; hence, the “res2net module” could obtain multiscale representations. Reference [50] proposed a squeeze-and-excitation (SE) block that selectively emphasized informative features and suppressed fewer useful features by using a feature recalibration mechanism. The mechanism obtained the channelwise weights of the input of an SE block by using global average pooling and a gating mechanism and then output features by using channelwise multiplication between the input and the weights. To dispose of the difficulties caused by increasing the depth and width of CNNs, [51] designed a family of “polyinceptions” to improve the performance from the perspective of exploring structural diversity. These “polyinceptions” replaced a residual unit [12] with a polynomial inception unit. Furthermore, [51] introduced two “inception-resnet units” by combining the inception architecture with residual connections.

Inspired by the human retina mechanism, [28] suggested a coupled convolutional layer that consisted of a set of mutually constrained convolutions. Reference [29] used a fully convolutional two-stream fusion network to extract deep features from input images and user interactions individually, which achieved better image segmentation performance. Instead of using large-scale convolutions to construct a deep CNN directly, [30] proposed a series of cascaded subpatch convolutions that included a small-scale convolution and a $1\times 1$ convolution, and the new architecture was more robust for classification due to the merits of the cascaded subpatch convolutions. Reference [31] proposed a wide multiscale contrast network that was composed of three networks with identical structures and three subnetworks to extract multiscale and multilevel features for salient object detection. Reference [32] used two $1\times 1$ convolutions with different channels to generate two different channel-domain features, and then the two features were propagated to multiply spatial selective modules to obtain different spatial-domain information. Based on the relevance of multiple modalities, [41] extracted multiple features at several different convolutional layers from different modality-specific CNNs by exploiting modality-dedicated embedding layers, and then weighted feature fusion was employed for biometric identification. Reference [43] constructed a deep multitask learning framework by fusing certain intermediate layers of an off-the-shelf CNN to optimize a multitask learning algorithm, and the fused network performed better than an individual learning network without feature fusion. To maintain a spatially more accurate heatmap, [54] proposed a new architecture called a high-resolution network (HRNet) by combining a set of high-to-low subnetworks with different resolutions in parallel. Between these subnetworks, different resolution information could be fused repeatedly; inside each subnetwork, every convolutional layer had the same resolution.

SECTION III.

Proposed Method

To decrease the number of network parameters, the current prevailing design of CNNs implements a downsampling operation (such as max pooling and average pooling) after some convolutional layers to obtain downsampled features. We divide a CNN into different blocks on the basis of downsampling. If we use downsampling at an early or a late stage of the network, we will obtain a shallow feature (low-level) or deep feature (high-level), respectively. As shown in Fig. 1, BaseNet consists of 4 blocks. We treat the features in a block, which is a series of stacked consecutive convolutional layers, without downsampling as the same size; we regard the features as the same scale if they are extracted from the same scale convolutions. The features extracted by Block 1 or Block 2 are the shallow features (low-level), and the features extracted by Block 3 or Block 4 are the deep features (high-level).

FIGURE 1.

We build BaseNet and AuxiNet for multiscale feature fusion. BaseNet consists of 4 blocks, and each block has many stacked convolution layers without downsampling. Namely, the features generated by every convolution layer in a block have the same size. AuxiNet consists of only one block, which is exploited to extract the low-level features at different scales. We define two feature fusion strategies, L-Fusion and H-Fusion, which fuse the low-level features extracted from AuxiNet to the low-level features and the high-level features extracted from BaseNet, respectively.

Show All

A. Different Feature Fusion Strategies

To some extent, increasing the width of the network can be regarded as feature fusion because the features extracted by many different convolutions in one layer or by multiple parallel subnetworks will finally be fused. The level features are important for classification tasks. In every block of a network, the features produced by preceding convolutional layers can be fused to the features extracted from the subsequent layers to improve the robustness of the network, which has been verified by [12] and [13]. To study the influence of fusing low-level features with high-level features, we design four different feature fusion strategies, which are shown in Fig. 1. We add another block that can generate different scale low-level features using different scale convolutions. For the sake of clarity, we call the adding block Auxi-Block, which is shown in Fig. 1. Auxi-Block has the same number of convolution layers as Block 1, which is employed to extract low-level features, but the scale of the convolution in each layer of Auxi-Block is different from Block 1. Hence, as shown in Fig. 1, we can fuse the features extracted from Auxi-Block to the low-level features and high-level features extracted from Block 1, Block 2, Block 3 and Block 4. We call the four aforementioned fusions L-Fusion 1, L-Fusion 2, H-Fusion 1, and H-Fusion 2. These four fusion strategies not only increase the width of the network but also use multiscale features, which intuitively enhance the performance of CNNs.

We define an input feature ${ \boldsymbol X}\in R^{W\times H\times D}$ , where $W$ represents the width, $H$ represents the height and D represents the channels of the feature. We define the filter of the first convolution in a block as $K_{1}\in R^{(K\times K\times D)\times C}$ , and we define the remaining filters of the block as $K_{l}\in R^{(K\times K\times C)\times C}$ , where $l\in ~2, 3,\ldots,n$ represent the layers, $C$ represent the channels and $K$ represents the kernel size. When ${ \boldsymbol X}$ passes through a block (a sequence of convolution layers), the output of every layer can be computed by \begin{align*} \left |{\begin{matrix} X_{1}=F[Conv2d\left ({{ \boldsymbol X}\ast K_{1} }\right)+b_{1}]&\in R^{W\times H\times C} \\ X_{2}=F[Conv2d\left ({X_{1}\ast K_{2}}\right)+b_{2}]& \in R^{W\times H\times C} \\ \qquad \ldots \ldots & \\ X_{l}=F[Conv2d\left ({X_{l-1}\ast K_{l}}\right)+b_{l}]& \in R^{W\times H\times C} \\ \end{matrix}}\right |\tag{1}\end{align*} View Source

When $X_{l}$ passes through a downsampling layer with $stride=S$ , the output is ${X}_{l}^{\prime }\in R^{C\times (W\times H)/S}$ . In Fig. 1, the output of Block $i$ is $O_{i}\in R^{W_{i}\times H}_{i}\times C_{i}$ , where $i \in 1, 2, 3, 4$ . In addition, the output of Auxi-Block is $O_{a}\in R^{W_{a}\times H_{a}\times C_{a}}$ . L-Fusion and H-Fusion refer to the concatenation of the features produced by different blocks. We can implement L-Fusion 1 directly as the same size features, but we must construct a translation layer including a $1\times 1$ convolution and $S\times S$ pooling to translate the size and channels of the low-level features for concatenation. Therefore, the output of L-fusion and H-fusion can be expressed as Equation (2), where $cat$ in the four formulas implies concatenation.\begin{align*} \left |{\begin{matrix} O_{L-Fusion 1}=cat\left ({O_{a},O_{1}}\right)&\in R^{W_{1}\times H_{1}\times (C_{a}+C_{1})} \\ O_{L-Fusion 2}=cat\left ({O_{a},O_{2}}\right)&\in R^{W_{2}\times H_{2}\times (C_{a}+C_{2})} \\ O_{H-Fusion 1}=cat\left ({O_{a},O_{3}}\right)&\in R^{W_{3}\times H_{3}\times (C_{a}+C_{3})} \\ O_{H-Fusion 2}=cat\left ({O_{a},O_{4}}\right)&\in R^{W_{4}\times H_{4}\times (C_{a}+C_{4})} \\ \end{matrix}}\right |\tag{2}\end{align*} View Source

Take L-Fusion 1 as an example. We hypothesize that Auxi-Block and Block 1 have $L>1$ layers, from the 1^st layer to the $L$ th layer. Forward propagation can be computed in (3), as shown at the bottom of the page.

Show All

When the outputs of Auxi-Block and Block 1 are concatenated, the fusion feature can be computed by Equation (4). Concatenation means that one weight matrix is stacked on another weight matrix; therefore, the loss of backward propagation influences ${X}_{L}$ and $X_{L}^{\prime }$ independently. Namely, the loss is propagated back to the input in two pipelines, and the input is perceived by two self-governed blocks rather than a single block.\begin{equation*} Fusion=F\left [{cat(X_{L},X_{L}^{\prime })\ast w+b }\right]\tag{4}\end{equation*} View Source

B. Multiscale Features

The convolution operation is a crucial element of deep learning structures since a number of filters slide across the input image [33]; hence, the filters are pivotal for the convolution operation. The larger a scale filter is, the larger the receptive field is. In some cases, increasing the scale of the convolution filter indeed improves the classification accuracy. For instance, many CNNs use a large filter, such as $11\times 11$ [4] and $7\times 7$ [5], in the first convolutional layer to increase the receptive field. However, the large scale means more parameters. For example, a single $7\times 7$ convolution requires 7²C² = 49C² parameters, whereas a single $3\times 3$ convolution requires only 3²C² = 9C² parameters, where C is the number of channels of the convolution. Additionally, [11] considered that a stack of two $3\times 3$ convolutional layers (without spatial pooling in between) has an effective receptive field of $5\times 5$ , and three $3\times 3$ convolutional layers have an effective receptive field of $7\times 7$ . Hence, the $3\times 3$ convolution is better than the $7\times 7$ convolution or a larger scale convolution, and it is the most common convolution in CNNs.

There are few networks that consist of a set of larger scale convolutional layers entirely, such as $7\times 7$ and $9\times 9$ . However, visual information with various scales helps improve the performance of CNNs. Therefore, we cannot discard large-scale convolutions. We select 5 convolution scales, including $3\times 3, 5\times 5, 7\times 7, 5\times 3\cup 3\times 5$ and $7\times 3\cup 3\times 7$ , to extract the different scale features. We use the nonsquare convolutions $5\times 3 \cup 3\times 5$ and $7\times 3\cup 3\times 7$ , where $\cup $ means implementing the $3\times 5$ convolution and $5\times 3$ convolution in parallel, based on the observation that the objects in an image can be rectangular.

Due to the small number of convolutional layers and the few channels in each convolutional layer in the preceding stages or blocks before the first downsampling, we stack a series of convolutional layers with a large-scale filter to construct a block rather than utilizing a single large-scale convolutional layer. The block not only ensures that we extract different scale features but also adds fewer parameters, which makes it the most different from other methods.

SECTION IV.

Experimental Results

In the first stage, according to the aforementioned definitions in section 3, we design two networks, Net 1 and VGG-16-V, based on the architectures of [13] and [11], respectively, to address the two important problems: 1) whether the fusion of multiscale features achieves better performance and 2) which scale feature is more effective. The configurations of the two networks and the feature fusion strategies are shown in Fig. 2 and Fig. 3, respectively. There are four feature fusion strategies at every scale, as shown in Fig. 1. Hence, there are 20 feature fusion experiments in total for a network. In the second stage, we select five networks, DenseNet-BC (depth = 40), ALL-CNN-C (depth = 9), Darknet 19 (depth = 19), Resnet 18 (depth = 18) and Resnet 50 (depth = 50), to verify the conclusions generated in the first stage.

FIGURE 2.

Feature fusion strategies for Net 1. We select Auxi-Blocks 1, 2, 3, 4 and 5 in turn to extract five different scale features. Then, we use every feature to implement four feature fusion strategies. We ignore the first convolution layer of Net 1 and Auxi-Block.

Show All

FIGURE 3.

Architecture of VGG-16-V, which is a variant of VGG-16. We also design the same features with different scales and the same fusion strategies as Net 1. Note that the “Conv2d” layer shown in the figure corresponds to the sequence BN-ReLU-Conv2d.

Show All

A. Datasets

We evaluate the proposed fusion strategies in the first stage on two standard benchmark datasets [36]: CIFAR10 and CIFAR100. These datasets both contain 50K training images and 10K test images, but they consist of 10 categories and 100 categories, respectively. We use all the 50K training images for training without validation during the training stage, and the 10K test images are used for testing during the testing stage. We normalize the data using the channel means and standard deviations as in [46]. During training, we adopt a data augmentation scheme with random cropping, random horizontal flips and normalization [47], which has been widely used for the two datasets [11], [12], [19] to obtain two augmented datasets that we call CIFAR10+ and CIFAR100+, respectively. During testing, we only normalize the data by using the channel means and standard deviations [47].

We adopt the ILSVRC 2012 classification dataset [45], which consists of 1.2 million images for training and 50,000 for validation from 1,000 classes, to further validate the effectiveness of our proposed method. We use the training set during the training stage and report the classification errors on the validation set. During training, we randomly crop the size of the training images to $224\times 224$ and exploit random horizontal flips and normalization (mean = [0.485,0.456,0.406], std = [0.229,0.224,0.225]). We do not leverage scale augmentation and standard color augmentation. During testing, we adopt standard single-crop testing and apply a center crop to resize the validation images to $224\times 224$ only.

B. Training

We implement these proposed networks on the PyTorch framework and two NVIDIA GeForce RTX 2080 GPUs. The weight initialization strategy is also introduced in [37]. For CIFAR10 and CIFAR100, all the networks are trained using stochastic gradient descent (SGD), cross-entropy loss and the ReLU. The weight decay, momentum and initial learning rate are set to 0.0001, 0.9 and 0.1, respectively. All the models are trained for 300 epochs, the learning rate is divided by 10 at the 150th and 225th epochs, and the batch size is 256. Based on Darknet 19 [42], Resnet 18 [12] and Resnet 50 [12], we implement a fusion operation on the ILSVRC 2012 classification dataset and train the three networks and their correspondingly modified networks only for 90 epochs with stochastic gradient descent (SGD), cross-entropy loss and ReLU. For Darknet 19 [42], leaky ReLU (negative slope equals 0.1) replaces ReLU. The initial learning rate is set to 0.1 and is divided by 10 at the 30th, 60th and 75th epochs. The weight decay and momentum are set to 0.0001 and 0.9, respectively. Due to the limitation of the GPU memory, the batch size is set to 64.

C. Classification Results of Net 1 and VGG-16-V

We construct two CNNs, Net 1 and VGG-16-V, to implement the feature fusion strategies. The input image size is $32\times 32$ and is RGB, which is the same as the images of CIFAR10 and CIFAR100.

Net 1, which is inspired by the architecture of [19], consists of 45 convolutional layers, and we build its variants Net 2 and Net 3. One of our purposes of this paper is to study the advantage of multiscale feature fusion, but we also seek to answer whether large-scale feature or multiscale feature fusion increases performance more. Therefore, we construct Net 2 and Net 3 to observe the performance of the large features. Their configurations are shown in Table 1. Except for the different convolutions in the first block, the other settings are the same in Net 1, Net 2 and Net 3. The convolutions of the first block are $3\times 3, 5\times 5$ and $7\times 7$ , respectively. The results are shown in Table 2.

TABLE 1 Configurations of Net 1, Net 2 and Net 3. Note That the “Conv2d” Layer Shown in the Table Corresponds to the Sequence BN-ReLU-Conv2d Except for the First Convolutional Layer. We Build Net 2 and Net 3 to Determine Whether the Large Scale of the Feature Itself or the Fusion Strategy of Multiscale Features Attains Better Performance. We Only Modify the First Convolutional Layer and the Structure of Block 1 to Construct Net 2 and Net 3. With the Application of a Large-Scale Convolution, the Receptive Fields of Net 2 and Net 3 are Larger Than That of Net 1

TABLE 2 Classification Error (%) and Parameters (M) for CIFAR10+ and CIFAR100+ Using Net 1, Net 2 and Net 3. Because of Employing a Larger Scale Convolution, Net 2 and Net 3 Have More Parameters Than Net 1, But Net 2 and Net 3 Have Lower Classification Accuracy. Hence, for the CIFAR Dataset, a Larger Scale Convolution is Not Necessary for Improving the Performance of CNNs

TABLE 3 Parameters (M) on CIFAR10+ and CIFAR100+. Based on Net 1, We Compare the Four Fusion Strategies With the Five Features Extracted From Different Auxi-Blocks. Fusing Auxi-Block Into Net 1 Does Not Dramatically Increase the Number of Parameters, Even if Auxi-Block Uses a Larger Scale Convolution: $7 \times7$ Convolution

$Table 3- Parameters (M) on CIFAR10+ and CIFAR100+. Based on Net 1, We Compare the Four Fusion Strategies With the Five Features Extracted From Different Auxi-Blocks. Fusing Auxi-Block Into Net 1 Does Not Dramatically Increase the Number of Parameters, Even if Auxi-Block Uses a Larger Scale Convolution: $7 \times7$ Convolution$

We construct another network based on VGG-16 [11], which is a famous network that has been widely utilized for other CNNs, such as in [34] and [35]. Because of its good classification performance, we chose VGG-16 to implement the proposed methods. We use only one fully connected layer rather than three fully connected layers, and we call this VGG-16 variant VGG-16-V, as shown in Fig. 3.

Based on the Net 1 and VGG-16-V architectures, we design four fusion strategies, and each strategy leverages five features at five scales. The details of the two networks and configurations are shown in Fig. 2 and Fig. 3, respectively. We select the $3\times 3$ convolution; two larger size convolutions, $5\times 5$ and $7\times 7$ ; and two unconventional convolutions, $5\times 3\cup 3\times 5$ and $7\times 3\cup 3\times 7$ . We do not use much larger convolutions, such as $9\times 9$ and $11\times 11$ , due to the considerable increases in the numbers of parameters.

Table 4 shows the classification errors of the four fusion strategies when the features extracted from different Auxi-Blocks are concatenated with the features generated from the different blocks of Net 1. Comparing the second column and the fourth column of Table 2, we find that all L-Fusion strategies improve the performance for both CIFRA10+ and CIFRA100+. On CIFAR10+, the best result shows that we reduce the error by 1.12% by adding the $5\times 5$ convolution. On CIFRA100+, we improve the accuracy by 2.79% compared to the best result by using the $5\times 5$ convolution. However, in terms of most H-Fusion strategies, multiscale feature fusion has no effect. Furthermore, the H-Fusion strategies lead to much worse results.

TABLE 4 Classification Error (%) on CIFAR10+ and CIFAR100+. The Results That are Better Than Those of Net 1 are in Blue and Bolded. All L-Fusions Improve the Performance of Net 1. However, Most H-Fusions Decrease the Classification Accuracy of Net 1

TABLE 5 Parameters (M) on CIFAR10+ and CIFAR100+. Using VGG-16-V, We Compare the Four Fusion Strategies With Five Types of Features. We Find That H-Fusion Has More Parameters Than L-Fusion, and the Increase in Parameters is More Obvious as the Number of Channels Increases

For VGG-16-V, the results of the experiments are shown in Table 6. Similarly, we find that all L-Fusion strategies can improve the performance for both CIFRA10+ and CIFRA100+. Furthermore, most H-Fusion strategies lead to poor performance. In terms of the best result, on CIFRA10+, we reduce the error by 1.59% by using the $7\times 3\cup 3\times 7$ convolution. In addition, on CIFRA100+, we improve the accuracy by 3.3% compared to the best result by adding the $7\times 3\cup 3\times 7$ convolution.

TABLE 6 Classification Error (%) on CIFAR10+ and CIFAR100+. The Results That are Better Than Those of VGG-16-V are in Blue and Bolded. The Results Show the Same Pattern as Table 4. All L-Fusions are Beneficial for Performance, But Most H-Fusions Hinder Performance

D. Fusion Operation

From Table 4 and Table 6, we find that all L-Fusion strategies result in better performance. Specifically, multiscale low-level feature fusion can remarkably improve the performance of CNNs. We consider these results to have very important statistical meaning. Moreover, we do not find that the nonsquare convolutions, $5\times 3\cup 3\times 5$ and $7\times 3\cup 3\times 7$ , are notably better than the square convolutions, $3\times 3$ , $5\times 5$ and $7\times 7$ . To take advantage of multiscale feature fusion and to keep as few parameters as possible, we select only the $5\times 5$ convolution to extract different low-level features for feature fusion.

As we discussed above, feature fusion is very important for enhancing the performance of CNNs. In all the aforementioned experiments, we apply channelwise concatenation to perform the feature fusion operation. Nevertheless, there is another principal fusion operation: channelwise summation. In this study, we harness the two operations to research which style is more suitable for the method we proposed and implement the two operations in VGG-16-V. Simultaneously, we emulated a multilayer perceptron (MLP) introduced by [19]. Following two fusion operations, we also implement a nonlinear activation operation: a $1\times 1$ convolutional layer followed by a ReLU function. The results of the experiments are shown in Table 7.

TABLE 7 CR Denotes a Nonlinear Activation Operation, Conv2d ( $1\times1$ )-ReLU; Sum Represents an Elementwise Summation and Cat Means a Channelwise Concatenation. All the Results are Implemented in the VGG-16-V and L-Fusion 1 Architectures Discussed in Subsection C

$Table 7- CR Denotes a Nonlinear Activation Operation, Conv2d ( $1\times1$ )-ReLU; Sum Represents an Elementwise Summation and Cat Means a Channelwise Concatenation. All the Results are Implemented in the VGG-16-V and L-Fusion 1 Architectures Discussed in Subsection C$

As depicted in Table 7, by integrating the CR operation, concatenation and summation were observed to improve performance. This result illustrated that the CR operation is helpful for improving performance. Moreover, the concatenation outperforms the elementwise summation, especially on CIFAR100+. We considered that the features extracted by different blocks are discriminative and independent. Elementwise summation will mutually disturb features, while concatenation can maintain the independence between features. Hence, we suggest using Concatenation + CR fusion operations and implementing these operations on the subsequent verification networks.

E. Classification Results of Five Verification Networks

To further support our hypotheses, we verify the fusion strategy on DenseNet-BC (depth = 40) [13] and ALL-CNN-C (depth = 9) [15] due to their different depths and high classification accuracy. We select the CIFAR dataset to evaluate the classification performance. The configurations of the two verification networks are shown in Table 8, and the results are shown in Table 9. Because the transition layer of Block 1 of DenseNet-BC has the same function as Concatenation + CR fusion operations, we delete it.

TABLE 8 We Modify the Architectures of DenseNet-BC (depth = 40) and ALL-CNN-C (depth = 9) by Adding the Same Number of Convolutional Layers (the Bold Parts) in the First Stage of the Two Networks. We Select the $5\times5$ Convolution to Obtain Large-Scale Features and Then Implement the L-Fusion Strategy. We Concatenate the Features Generated by Block 1 and Auxi-Block and Then Implement Concatenation + CR Fusion Operations

$Table 8- We Modify the Architectures of DenseNet-BC (depth = 40) and ALL-CNN-C (depth = 9) by Adding the Same Number of Convolutional Layers (the Bold Parts) in the First Stage of the Two Networks. We Select the $5\times5$ Convolution to Obtain Large-Scale Features and Then Implement the L-Fusion Strategy. We Concatenate the Features Generated by Block 1 and Auxi-Block and Then Implement Concatenation + CR Fusion Operations$

TABLE 9 Error Rates (%) and Parameters (M) on the CIFAR Datasets. The Results Show That Fusing Different Scale Low-Level Features in the Preceding Stage of the Two Networks Can Dramatically Improve the Performance. The Classification Accuracy of the Modified DenseNet-BC (40) Outperforms the Classification Accuracy of ResNet-110 But With Fewer Convolutional Layers and Parameters

From Table 9, we find that the approach we proposed significantly improves the performance of DenseNet-BC and ALL-CNN-C. On CIFAR10+, we improve the accuracy of DenseNet-BC and ALL-CNN-C by 0.76% and 1.15%, respectively. On CIFRA100+, we reduce the error of DenseNet-BC and ALL-CNN-C by 2.25% and 4.68%, respectively. Although the number of parameters has increased greatly, the computational cost is very low.

We select the ILSVRC 2012 classification dataset to observe the applicability of the proposed approach to a larger dataset. Darknet 19 [42], Resnet 18 [12] and Resnet 50 [12] have good performance on the ILSVRC 2012 classification dataset and have different numbers of convolutional layers. Based on the L-Fusion 1 architecture, we append an Auxi-Block to the three networks to construct Darknet-19-Fusion, Resnet-18-Fusion and Resnet-50-Fusion, and the specific structures are shown in Table 10. We only show the changes brought to these three networks by the proposed method, and the remaining architectures of these three networks remain unchanged. We append two parallel $5\times 5$ convolutional layers to form Darknet-19-Fusion, and we add one parallel block that replaces the $7\times 7$ convolutional layer and all $3\times 3$ convolutional layers with a $5\times 5$ convolutional layer to form Resnet-18-Fusion and Resnet-50-Fusion. The classification results are shown in Table 10. We find that the performance of Darknet 19 has been slightly improved by only adding two parallel convolutional layers and concatenation + CR fusion operations. Similarly, by appending a parallel block in the preceding stage, Resnet-18-Fusion and Resnet-50-Fusion have significantly better results than Resnet 18 and Resnet 50. For Resnet-18-Fusion, we improve the top-1 accuracy and top-5 accuracy by 1.59% and 0.96%, respectively. In terms of Resnet-50-Fusion, we reduce the top-1 error and top-5 error by 0.4% and 0.6%, respectively.

TABLE 10 Parameters (M) and Error (%, Single-Crop Testing) on ImageNet Validation. During the Training Stage, the Input Size of These Six Networks is $224\times224$ . During the Testing Stage, We Only Adopt Single-Crop Testing and Obtain the Classification Errors at a Single Size: $224\times224$ . The Bold Black Parts Show the Structures Added for Feature Fusion. Because of the Simple Parallel Structure Added, Darknet-19-Fusion has Better Results Than Darknet 19. Similarly, by Appending a Parallel Block in the Preceding Stage, Resnet-18-Fusion and Resnet-50-Fusion Have Significantly Better Results Than Resnet 18 and Resnet 50

$Table 10- Parameters (M) and Error (%, Single-Crop Testing) on ImageNet Validation. During the Training Stage, the Input Size of These Six Networks is $224\times224$ . During the Testing Stage, We Only Adopt Single-Crop Testing and Obtain the Classification Errors at a Single Size: $224\times224$ . The Bold Black Parts Show the Structures Added for Feature Fusion. Because of the Simple Parallel Structure Added, Darknet-19-Fusion has Better Results Than Darknet 19. Similarly, by Appending a Parallel Block in the Preceding Stage, Resnet-18-Fusion and Resnet-50-Fusion Have Significantly Better Results Than Resnet 18 and Resnet 50$

F. Applicability

The residual block [38], dense block [13] and inception module [11] are fundamental components used to construct high-performing architectures, especially the first design. These three layouts fuse features on a layer basis. The residual block and dense block integrate the features extracted from a certain layer in front with the features extracted from the layers behind, and the inception module exploits multiple filters in a layer to represent the features. Nevertheless, we consider fusing the features extracted from different blocks, and the fusing operation is only used in the preceding stages of CNNs. This approach provides the most discrimination between the state-of-the-art method and our method.

In this paper, one of the intentions is to study whether the fusion of low-level features at different scales into different stages of a CNN improves its performance. The experimental results show that L-Fusion is helpful for enhancing the performance of CNNs. Moreover, by applying the L-Fusion 1 structure, which is shown in Fig. 1, to five verification networks, we further verify our conclusions. The Auxi-Block shown in Fig. 4 is easy to build and only adds a small overhead over the model parameters and computation. Hence, our method can update some off-the-shelf networks by elaborately designing the structure we proposed. We suggest building an Auxi-Block according to the structure of the first block of a CNN; however, we cannot ensure that our approach will work when the first block has too many convolutional layers.

FIGURE 4.

Proposed architecture. Auxi-Block and Block 1 are two parallel blocks. In terms of an off-the-shelf network, we can improve its performance simply by adding an Auxi-Block that has the same architecture as Block 1, except for a different scale convolution. The fusion operation coincides with the human retina mechanism introduced in [28], which considered that there are two types of ganglion cells with respect to the receptive field.

Show All

G. Results Analysis

CNNs can learn a hierarchy of features [39]. CNNs represent the low-level features that are visually recognizable in the preceding stage and represent the high-level features that are semantically recognizable in subsequent stages. L-Fusion can lead CNNs to learn low-level features with different scales. Extracting multiscale low-level features is why L-Fusion can enhance performance. The results shown in Tables 4, 6, 9 and 10 confirm this conclusion. Additionally, the effect of dense connectivity [13] is another reason that L-Fusion improves performance.

High-level features are gradually learned from low-level features [30]. According to the high-level features, the network can determine the object in an image. We consider that the semantically recognizable features will be turbulent if we fuse the low-level features with the high-level features for inference. Namely, the fused high-level features include strong and weak semantic features simultaneously. Therefore, H-Fusion will result in poor performance, and its results are shown in Table 4 and Table 6.

SECTION V.

Conclusion

In this paper, we divide a CNN into different blocks according to the size of the features to obtain low-level and high-level features for feature fusion. We design two fusion strategies, L-Fusion and H-Fusion, to assess the influence of feature fusion at different stages. We select five low-level features with different scales to determine the advantage of multiscale feature fusion. L-Fusion, which fuses a low-level feature with different scales extracted from an auxiliary block to the low-level features extracted by a CNN, is observed to improve performance. The auxiliary block can be built according to the structure of the first block of a CNN. We validate the conclusion on five CNNs with high classification accuracy, and the experimental results show that our method achieves state-of-the-art performance. Simultaneously, the proposed architecture will not substantially increase the parameter of a CNN because the fusion operation takes place in the preceding stage.

References is not available for this document.

Improving the Performance of Convolutional Neural Networks by Fusing Low-Level Features With Different Scales in the Preceding Stage

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Optimizing the Pooling Operation and Convolution Operation