Introduction
Achieving high performance in semantic segmentation relies heavily on pixel-level annotations [1], [2], which can be expensive to obtain in large quantities. Therefore, there is a need to explore methods to achieve good performance with a limited amount of annotated data. Few-shot semantic segmentation has emerged as a solution to this problem [3]–[5]. The goal of few-shot semantic segmentation is to segment unseen classes using only a few support samples and to learn transferable knowledge during the training process.
Prototype networks are known for their interpretability and noise resistance [6], [7]. Existing methods attempt to address the differences between support and query images by generating representative prototypes [8]–[10]. Prototype networks extract prototypes obtained from the backbone network using mask average pooling, and guide the matching of query features. However, prototype learning methods often encounter several challenges: ❶ Excessive compression of prototypes will lead to the loss of spatial information. ❷ Significant intra-class representation differences can interfere with segmentation accuracy.
To address these issues, some methods generate prototypes from support image features to guide segmentation predictions (e.g., PMM [11], PPNet [12], and ASG [13]). However, the problem of intra-class diversity makes it difficult for prototypes generated solely from support images to accurately match query images. Therefore, some methods alleviate the issue of inaccurate segmentation by incorporating query features into prototypes (e.g., SSP [14] and DPNet [15]). Although these methods increase the weight of query features in prototypes, the prototypes still contain a significant proportion of support features. This phenomenon introduces a new issue: ❸ Invalid prototype features will result in the inability to match the distribution of real class features.
To address these challenges, we propose a novel self-support prototype-aware learning network that effectively utilizes query features to generate query prototypes for guiding feature matching. Unlike previous methods, we leverage the full potential of query features. Our motivation stems from the impressive results that foreground and background prototypes composed of a small random subset of query features demonstrate remarkable effectiveness. Therefore, we aim to identify highly confident query foreground and background. SSPA obtain highly confident query features by matching the similarity between support and query images. To enhance the contribution of prototypes, we further optimize important features in the prototypes, decouple the target object in complex semantic information, and supervise the quality of query feature selection. In summary, our work aims to obtain trustworthy and robust perceptual prototypes, addressing issues ❶, ❷, and ❸ for more precise segmentation.
In conclusion, our main contributions are as follows:
We propose a self-support prototype-aware learning network to mitigate the visual representation discrepancies caused by intra-class diversity.
SSPA obtains highly reliable query features to eliminate the interference of invalid support features.
Our method achieves state-of-the-art performance on multiple datasets, demonstrating its effectiveness.
Methods
A. Motivation
The core motivation of our work is to leverage information from query images to enhance the representation of prototypes, thereby improving the accuracy of segmentation. We design Cycle Consistency Collection module and Self-Support Collection module to address the interference of invalid support prototypes, providing a new research direction for the field of few-shot semantic segmentation. The detailed structure of the network proposed by us is shown in Fig. 1, and each component will be introduced in detail.
B. Cycle Consistency Collection
First, we aim to achieve high-confidence query feature collection. Specifically, for a given image feature {Fs, Fq} ∈ RD×H×W, in order to make the features of screening more representative, we introduce channel attention mechanism to optimize Fs and Fq. Considering that our method relies on prototypes composed of features, the features should contain rich information, so we chose to avoid dimensionality reduction operations and use the Efficient Channel Attention (ECA) mechanism [16]. Fs and Fq are flattened into 1D sequences {Cs, Cq} ∈ RD×HW after ECA operation.
\begin{equation*}{C^s},{C^q} = {\text{flatten}}\left\{ {ECA\left( {{F^s},{F^q}} \right)} \right\},\tag{1}\end{equation*}
We can get support foreground feature set \begin{align*} & {A_1} = \operatorname{matmul} \left( {{C^{{q^T}}},{C^{s'}}} \right),\tag{2} \\ & {j^{\ast}} = \mathop {\operatorname{argmax} }\limits_{j \in \{ 0,1, \ldots ,HW - 1\} } {A_1}\left( {j,i'} \right),\tag{3}\end{align*}
\begin{align*} & {A_2} = {\text{matmul}}\left( {{C^s},{C^{{{q'}^T}}}} \right), \tag{4} \\ & {i^{\ast}} = \mathop {\operatorname{argmax} }\limits_{i \in \{ 0,1, \ldots ,HW - 1\} } {A_2}\left( {i,{j^{\ast}}} \right), \tag{5} \end{align*}
C. Self-Support Collection
Next, we mainly use cascaded feature matching and filter it based on the feature matching results to obtain high-confidence query features. Specifically, we will utilize the high-confidence query features collected from previous feature matching and the cycle consistency to enhance the next round of feature matching. Taking the first feature matching as an example, we perform the following calculations.
\begin{align*} & P_1^f,P_1^b = {\alpha _1}\left( {\frac{1}{{\operatorname{len} \left( {C_0^f} \right)}}\sum\limits_{F \in C_0^f} F } \right) + {\alpha _2}P_s^f,P_s^b,\tag{6} \\ & {A^f},{A^b} = \cos \left( {P_1^f,{F^q}} \right),\cos \left( {P_1^b,{F^q}} \right),\tag{7} \\ & {M_1} = \operatorname{soft} \max \left( {{A^f},{A^b}} \right),\tag{8}\end{align*}
\begin{equation*} C_1^f,C_1^b = \left\{ {C_i^q|M_1^f(i) > {\tau _1}} \right\},\left\{ {C_j^q|M_1^b(j) > {\tau _2}} \right\}\tag{9}\end{equation*}
Specifically, we use the Superpixel-guided Clustering (SGC) method [13] to generate multiple support prototypes
We propose a pixel-level dilation of important prototypes scheme to achieve the allocation of multiple supporting foreground prototypes. First, each support prototype \begin{align*} & {A^{sfl}} = \cos \left( {{F^q},P_{sl}^f} \right),\tag{10} \\ & A_{ij}^{sf} = \mathop {\max }\limits_{l \in \{ 0,1, \ldots ,k\} } A_{ij}^{sfl},\tag{11}\end{align*}
Overall architecture of the proposed SSPA. CCC combined channel attention mechanism and cycle consistency to select high-confidence query foreground features. SSC use ascaded feature selection to obtain higher-quality query features and realize independent decision-making of query features. Finally, PCI uses contrastive learning to realize object decoupling and feature inspecting.
Finally, Af and Asf are weighted averaged to obtain the final similarity matrix A, which combines local and global information. A is then used to replace Af in the equation (8) to optimize the initial feature matching.
D. Prototype Contrastive Inspection
The erroneous coupling of background regions with the target object is a common issue. Therefore, we aim to alleviate this problem by introducing contrastive learning, and the prototype quality can be inspected. Specifically, we consider the support foreground prototype \begin{align*} & {d_{{\text{pos}}}} = {e^{\cos \left( {P_Q^f,P_s^f} \right)}} + {e^{\cos \left( {P_Q^f,P_0^f} \right)}} + {e^{\cos \left( {P_Q^f,P_t^f} \right)}},\tag{12} \\ & {L_{pci}} = - \log \frac{{{d_{pos}}}}{{{d_{pos}} + {e^{\cos \left( {P_Q^f,P_t^b} \right)}}}},\tag{13}\end{align*}
E. Training Loss
Considering the trade-off between model performance and efficiency, we set the number of feature matching iterations to three and use M3 as the final prediction result. During the training phase, we supervise the last two feature-matching results as follows:
\begin{equation*}{L_m} = BCE\left( {{M_2},M_q^f} \right) + BCE\left( {{M_3},M_q^f} \right),\tag{14}\end{equation*}
\begin{equation*}L = {\lambda _1}{L_m} + {\lambda _2}{L_{pci}},\tag{15}\end{equation*}
Experiments
A. Dataset and Evaluation Metrics
We evaluated our model on two datasets, namely PASCAL-5i [3] and COCO-20i [18], which are widely used in previous few-shot semantic segmentation (FSS) methods. Similar to prior works [19], we divided the object categories in these two datasets into four folds and conducted experiments using cross-validation. The mean intersection over union (mIoU) was used as the primary evaluation metric for all experiments.
B. Implementation Details
We adopt the popular pre-trained models ResNet-101 [20] as the backbone. We discard the last block in the backbone and freeze the first two blocks. The initial learning rate is set to 1e−3, and we use the SGD optimizer with a momentum of 0.9 to update the parameters. Both support and query images are cropped to 400 × 400. We train the model for 20 epochs on both datasets with a batch size of 8.
C. Comparison with State-of-the-Art (SOTA)
Quantitative results. We compared the performance with the advanced few-shot semantic segmentation methods in recent years. As shown in Table I and II, we reported the best performance of these methods on the two mainstream datasets, PASCAL-5i and COCO-20i, and marked the backbone used for this performance. We use SSP [14] method to build our baseline model. As shown in Table I, our method significantly improves the performance over the baseline on PASCAL-5i, we achieve a 5.1% and 3.3% increase in mIoU for 1-shot and 5-shot, respectively, achieving the SOTA in both cases. Compared to PASCAL-5i, COCO-20i has more complex segmentation scenarios and is highly challenging. As shown in Table II, our method has also achieved remarkable results on this dataset, compared to the baseline, we achieve a 6.7% and 4.9% increase in mIoU for 1-shot and 5-shot, respectively, achieving the SOTA in both cases. It is worth noting that in addition to performance advantages, we also have significant advantages in training efficiency. Compared with most methods [26], [30], [31] that require 200 epochs of training, we only need 20 epochs to achieve remarkable performance, which confirms the effectiveness of our prototype-aware network.
Qualitative Results. Our method effectively overcomes the intra-class diversity problem. As shown in Fig. 2, when there are marked appearance disparities between the support and query images, such as size, color, and shape, we can achieve segmentation tasks well.
D. Ablation Study
Ablation studies are conducted to investigate the impact of each component on segmentation performance and the performance variations under different settings. The experiments are conducted using the 1-shot segmentation performance on the PASCAL-5i dataset with the ResNet-101 backbone.
Effectiveness of SSPA Components. The three components in SSPA, namely Cycle Consistency Collection (CCC), Self-Support Collection (SSC), and Prototype Contrastive Inspection (PCI) are crucial. We sequentially examine the effectiveness of these components in Table III. CCC, which combines channel attention and cycle consistency, filters out high-quality query foreground features to guide feature matching, resulting in a 2.1% performance improvement. SSC optimizes the initial segmentation prediction, making the subsequent feature selection more reliable. It also filters out query features with significant semantic signals based on cascade prediction to construct high-quality query prototypes for achieving fine-grained segmentation prediction, leading to a 1.6% performance improvement. PCI aims to decouple the target objects and verify the feature selection results to better supervise the selecting quality of the above CCC and SSC, achieving a 1.4% performance improvement by introducing contrastive learning. These three modules cooperate with each other to jointly promote the generation of high-quality query prototypes, greatly improve the prototype perception ability of the network, effectively alleviate the intra-class diversity problem, and achieve top-notch performance. These significant improvements verify the effectiveness of the modules we proposed.
Qualitative results of the proposed SSPA and baseline approach. From top to bottom: support images, query images, prediction of baseline, prediction of SSPA.
Conclusion and Discussion
In this paper, we proposed a novel self-support prototype-aware learning network (SSPA) for few-shot semantic segmentation. We addressed the challenges of prototype-based methods caused by intra-class diversity and the limitations of support prototypes in accurately matching query foreground. We believe that our work opens up new possibilities for advancing the field of few-shot semantic segmentation and encourages further research in this area.
Limitation and Future Work. Although SSPA outperforms existing methods on various datasets, it fails to accurately segment details and edges, which requires more targeted approaches to represent the details and edges. Future work will be dedicated to improving the performance of the model in these challenging scenarios and making it a more robust and generalizable visual perception model.