1. Introduction
Following the success of deep learning in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [20], the best performance in classification competitions has almost invariably been achieved on convolutional neural network (CNN) architectures. AlexNet [16] is composed of three types of receptive field convolutions . VGG [21] is based on the idea that a stack of two convolutional layers with a receptive field is more effective than a convolution. GoogleNet [24]–[26] introduced an Inception layer for the composition of various receptive fields. The residual network [10], [11], [29], which adds shortcut connections to implement identity mapping, allows more layers to be stacked without running into the gradient vanishing problem. Recent research on CNNs has mostly focused on composing layers rather than the convolution itself.