1. INTRODUCTION
Image classification, particularly within the domain of computer vision research, has received substantial attention over the years. Recently excellent performance has been obtained with the development of deep learning methods. A key challenge in this area is how to effectively learn image representations. For general image classification, the category of each image is highly related to the object in the image, as shown in Figure 1. However, the category of the scene image is related to several typical objects and its spatial layout. And the diversity of spatial layouts and object co-occurrences between scenes may lead to intra-class differences and inter-class similarities. This reduces the accuracy of scene recognition. Thus, deep learning methods for common image classification are not well suited for scene classification. To address these problems, we propose a more robust and effective method to extract multi-modal features and model their relationships by a cross-modal matching procedure for scene recognition.