I. Introduction
Few-shot learning (FSL) aims to recognize novel categories containing only a few labeled examples. The prerequisite of this task is to make full use of the base categories that contain abundant labeled training samples [2], [3], [4]. Recent FSL work has achieved promising results by obtaining a robust feature extractor (backbone) in a good structural design or model training [2], [3], [5]. It trains a backbone on the known (base) categories and aims to yield a transferable feature representation (textures and structures) to describe the novel categories. However, training and validating backbones from scratch are time-consuming and expensive processes. Meanwhile, the backbone trained on the base categories is more inclined to focus on the textures of the objects it learns [4], [6], [7], [8]. As shown in Fig. 1, a backbone trained on the base categories responds to different regions (provided by Grad-CAM [1]) on samples of different categories: Given a base sample in the “Unicycle” category, the responsive regions on the image focus on the body of the cycle since the backbone is trained with many unicycle images. However, this backbone may deviate the responses from novel objects and overlook them. For example, it concentrates on the baby carriage which contains wheels but not the dog in the image with the label “Retriever”. To enlarge or correct the response regions of novel objects, many methods have been proposed. For example, Liu et al. mix the image patches randomly and use the mixed image as input of backbone [5]. Wang et al. split an image into three parts and represent it by three views [4]. However, the aforementioned methods need to be trained from scratch. Moreover, the latter method requires training three times larger backbones.
The responsive regions of the backbone are visualized by Grad-CAM [1] from several samples in Mini-ImageNet, where the backbone is trained on the base categories.