1. Introduction
Deep convolutional neural networks [14],[29],[40] have advanced the development of many downstream vision tasks, such as semantic segmentation [1], [3], [21], [25], [49], which is a dense prediction task. To achieve an effective segmentation model, e.g. Deeplab [3], large amounts of pixel-wise image annotations are required for model training, which are costly and time consuming to acquire. Weakly supervised learning [23], [43] can alleviate the annotation costs to some extent; however, the model performance drops significantly under this scenario. Moreover, both fully and weakly supervised models suffer poor generalization to unseen domains with only a few densely-annotated training images available. As such, few-shot semantic segmentation (FSS) [27] is proposed for dealing with the unseen object segmentation and data annotation issue.