I. Introduction
Correctly classifying images based on their visual contents have been widely explored in recent years. Local feature-based methods are used along with the bag-of-visual-word (BoW) strategies [1]–[4]. Local features [5] and [6] are well designed by experts to cope with various visual changes and noisy information. Although effective, local features are very hard to design which requires rich domain knowledge and experiences. The deep convolutional neural network (CNN)-based methods [7]–[12] try to directly learn the representations from image pixels. This strategy has been proven very effective for various visual applications [13]–[18]. The performances can be further improved with deeper networks and architectures.