I. Introduction
Scene recognition is a challenging task in computer vision [1], [2] and has attracted significant research attention due to its potential for indoor and outdoor applications, such as autonomous driving and robotic applications. Recognizing the scene from a single image is a challenging problem, particularly for large datasets [3], [4]. The challenges include large intraclass variations, high interclass similarities, scale variations, and viewpoint differences. Fig. 1 shows that images in library-outdoor and general_store-outdoor categories can be easily confused. Similar ambiguities can be found for computer_room and office_cubicles. The fact that scene images not only include various local objects but also have meaningful global spatial properties, makes scene recognition challenging.
Example images from the Places2 dataset, including outdoor to indoor scene categories. (a) Library-outdoor. (b) General_store-outdoor. (c) Computer_room. (d) Office_cubicles.