Conferences >ICASSP 2025 - 2025 IEEE Inter...

Self-Support Prototype-Aware For Few-Shot Semantic Segmentation

Abstract:

In recent years, significant progress has been made in prototype-based learning methods for few-shot semantic segmentation. However, prototype features originating from t...Show More

Metadata

Abstract:

In recent years, significant progress has been made in prototype-based learning methods for few-shot semantic segmentation. However, prototype features originating from the support images are interfered with by intra-class diversity and thus cannot be aligned with the query foreground, resulting in poor segmentation accuracy. Therefore, we propose a novel self-support prototype-aware (SSPA) network to obtain highly confident query foreground pixel points and their corresponding query features. We design Cycle Consistency Collection module and Self-Support Collection module to address the interference of invalid support prototypes. Experimental results demonstrate that our SSPA significantly improves the quality of prototypes and achieves state-of-the-art segmentation results on multiple datasets. In particular, SSPA achieves mIoU scores of 69.7% and 76.4% for 1-shot and 5-shot segmentation, respectively, on PASCAL-5ⁱ.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10890480

Conference Location: Hyderabad, India

Funding Agency:

Contents

SECTION I.

Introduction

Achieving high performance in semantic segmentation relies heavily on pixel-level annotations [1], [2], which can be expensive to obtain in large quantities. Therefore, there is a need to explore methods to achieve good performance with a limited amount of annotated data. Few-shot semantic segmentation has emerged as a solution to this problem [3]–[5]. The goal of few-shot semantic segmentation is to segment unseen classes using only a few support samples and to learn transferable knowledge during the training process.

Prototype networks are known for their interpretability and noise resistance [6], [7]. Existing methods attempt to address the differences between support and query images by generating representative prototypes [8]–[10]. Prototype networks extract prototypes obtained from the backbone network using mask average pooling, and guide the matching of query features. However, prototype learning methods often encounter several challenges: ❶ Excessive compression of prototypes will lead to the loss of spatial information. ❷ Significant intra-class representation differences can interfere with segmentation accuracy.

To address these issues, some methods generate prototypes from support image features to guide segmentation predictions (e.g., PMM [11], PPNet [12], and ASG [13]). However, the problem of intra-class diversity makes it difficult for prototypes generated solely from support images to accurately match query images. Therefore, some methods alleviate the issue of inaccurate segmentation by incorporating query features into prototypes (e.g., SSP [14] and DPNet [15]). Although these methods increase the weight of query features in prototypes, the prototypes still contain a significant proportion of support features. This phenomenon introduces a new issue: ❸ Invalid prototype features will result in the inability to match the distribution of real class features.

To address these challenges, we propose a novel self-support prototype-aware learning network that effectively utilizes query features to generate query prototypes for guiding feature matching. Unlike previous methods, we leverage the full potential of query features. Our motivation stems from the impressive results that foreground and background prototypes composed of a small random subset of query features demonstrate remarkable effectiveness. Therefore, we aim to identify highly confident query foreground and background. SSPA obtain highly confident query features by matching the similarity between support and query images. To enhance the contribution of prototypes, we further optimize important features in the prototypes, decouple the target object in complex semantic information, and supervise the quality of query feature selection. In summary, our work aims to obtain trustworthy and robust perceptual prototypes, addressing issues ❶, ❷, and ❸ for more precise segmentation.

In conclusion, our main contributions are as follows:

We propose a self-support prototype-aware learning network to mitigate the visual representation discrepancies caused by intra-class diversity.
SSPA obtains highly reliable query features to eliminate the interference of invalid support features.
Our method achieves state-of-the-art performance on multiple datasets, demonstrating its effectiveness.

SECTION II.

Methods

A. Motivation

The core motivation of our work is to leverage information from query images to enhance the representation of prototypes, thereby improving the accuracy of segmentation. We design Cycle Consistency Collection module and Self-Support Collection module to address the interference of invalid support prototypes, providing a new research direction for the field of few-shot semantic segmentation. The detailed structure of the network proposed by us is shown in Fig. 1, and each component will be introduced in detail.

B. Cycle Consistency Collection

First, we aim to achieve high-confidence query feature collection. Specifically, for a given image feature {F^s, F^q} ∈ R^D×H×W, in order to make the features of screening more representative, we introduce channel attention mechanism to optimize F^s and F^q. Considering that our method relies on prototypes composed of features, the features should contain rich information, so we chose to avoid dimensionality reduction operations and use the Efficient Channel Attention (ECA) mechanism [16]. F^s and F^q are flattened into 1D sequences {C^s, C^q} ∈ R^D×HW after ECA operation.

$\begin{equation*}{C^s},{C^q} = {\text{flatten}}\left\{ {ECA\left( {{F^s},{F^q}} \right)} \right\},\tag{1}\end{equation*}$ View Source

We can get support foreground feature set ${C^{s'}} \in {R^{D \times {H^{s'}}{W^{s'}}}}$ based on the label mask of the supported image. Next, the affinity matrix A₁ is computed to measure the similarity relationship between all query features and support foreground features. Next, for each support foreground feature ${C_{i'}}s'$ , the most similar query feature in C^q is matched.

$\begin{align*} & {A_1} = \operatorname{matmul} \left( {{C^{{q^T}}},{C^{s'}}} \right),\tag{2} \\ & {j^{\ast}} = \mathop {\operatorname{argmax} }\limits_{j \in \{ 0,1, \ldots ,HW - 1\} } {A_1}\left( {j,i'} \right),\tag{3}\end{align*}$ View Source

where

${C^q}^{^T}$

denotes the transpose of C^q. These low-confidence query features

$C_{{j^{\ast}}}^q$

constitute a set

${C^{q'}} \in {R^{D \times {H^{q'}}{W^{q'}}}}$

, where

${H^q}^{\prime} {W^q}^{\prime} = {H^s}^{\prime} {W^s}^{\prime}$

. In order to further ensure the reliability of these query features, the concept of cycle consistency is introduced for verification. First, the affinity matrix A₂ is computed. Next, for each low-confidence query feature

${C_{j*}}q'$

, the most similar support feature in C^s is matched.

$\begin{align*} & {A_2} = {\text{matmul}}\left( {{C^s},{C^{{{q'}^T}}}} \right), \tag{4} \\ & {i^{\ast}} = \mathop {\operatorname{argmax} }\limits_{i \in \{ 0,1, \ldots ,HW - 1\} } {A_2}\left( {i,{j^{\ast}}} \right), \tag{5} \end{align*}$

View Source

where

${C^q}^{{\prime ^T}}$

denotes the transpose of C^q′. Given the support label

${y_s} \in {R^H}^{^s{W^s}}$

, if the starting pixel i′ and the ending pixel i* share the same class label, i.e., y_s(i′) = y_s(i*), it is considered to satisfy cycle consistency, and added to the set of confident query features

$C_0^f$

C. Self-Support Collection

Next, we mainly use cascaded feature matching and filter it based on the feature matching results to obtain high-confidence query features. Specifically, we will utilize the high-confidence query features collected from previous feature matching and the cycle consistency to enhance the next round of feature matching. Taking the first feature matching as an example, we perform the following calculations.

$\begin{align*} & P_1^f,P_1^b = {\alpha _1}\left( {\frac{1}{{\operatorname{len} \left( {C_0^f} \right)}}\sum\limits_{F \in C_0^f} F } \right) + {\alpha _2}P_s^f,P_s^b,\tag{6} \\ & {A^f},{A^b} = \cos \left( {P_1^f,{F^q}} \right),\cos \left( {P_1^b,{F^q}} \right),\tag{7} \\ & {M_1} = \operatorname{soft} \max \left( {{A^f},{A^b}} \right),\tag{8}\end{align*}$ View Source

where, α₁ and α₂ are 0.5 and 0.5, respectively. cos is the cosine similarity measurement,

$P_s^f$

and

$P_s^b$

are support foreground prototype and support background prototype, respectively. Next, we utilize a threshold to obtain high-confidence query features based on the matching results M₁.

$\begin{equation*} C_1^f,C_1^b = \left\{ {C_i^q|M_1^f(i) > {\tau _1}} \right\},\left\{ {C_j^q|M_1^b(j) > {\tau _2}} \right\}\tag{9}\end{equation*}$

View Source

where, τ₁ and τ₂ are set as 0.7 and 0.6, respectively. Afterward, we compose higher-quality prototypes based on the selected high-confidence query features and repeat the above steps to perform more accurate feature matching M_t and feature selection C_t. However, the support prototype lacks local information because it is constructed from the average of features within the target’s global region, a single support prototype

$P_s^f$

cannot achieve good feature matching.

Specifically, we use the Superpixel-guided Clustering (SGC) method [13] to generate multiple support prototypes $P_s^{f'} = \left\{ {P_{s1}^f,P_{s2}^f, \ldots ,P_{sk}^f} \right\}$ , where we set k as 5.

We propose a pixel-level dilation of important prototypes scheme to achieve the allocation of multiple supporting foreground prototypes. First, each support prototype $P_{sl}^f$ in $P_s^{f'}$ is matched with query features one by one to obtain the corresponding similarity matrix A^sfl. Then, at each matrix coordinate (i, j), the maximum similarity $A_{ij}^{sf}$ among these matrices at the corresponding position is taken as their mixed similarity matrix A^sf.

$\begin{align*} & {A^{sfl}} = \cos \left( {{F^q},P_{sl}^f} \right),\tag{10} \\ & A_{ij}^{sf} = \mathop {\max }\limits_{l \in \{ 0,1, \ldots ,k\} } A_{ij}^{sfl},\tag{11}\end{align*}$ View Source

Fig. 1.

Overall architecture of the proposed SSPA. CCC combined channel attention mechanism and cycle consistency to select high-confidence query foreground features. SSC use ascaded feature selection to obtain higher-quality query features and realize independent decision-making of query features. Finally, PCI uses contrastive learning to realize object decoupling and feature inspecting.

Show All

Finally, A^f and A^sf are weighted averaged to obtain the final similarity matrix A, which combines local and global information. A is then used to replace A^f in the equation (8) to optimize the initial feature matching.

D. Prototype Contrastive Inspection

The erroneous coupling of background regions with the target object is a common issue. Therefore, we aim to alleviate this problem by introducing contrastive learning, and the prototype quality can be inspected. Specifically, we consider the support foreground prototype $P_s^f$ , initial query foreground prototype $P_0^f = \frac{1}{{\operatorname{len} \left( {C_0^f} \right)}}\sum\nolimits_{F \in C_0^f} F$ , and final query foreground prototype $P_t^f = \sum\nolimits_{i = 0}^{t - 1} {\lambda _i^t} \left( {\frac{1}{{\operatorname{len} \left( {C_i^f} \right)}}\sum\nolimits_{F \in C_i^f} F } \right)$ as positive samples, while the final query background prototype $P_t^b = \sum\nolimits_{i = 1}^{t - 1} {\lambda _i^t} \left( {\frac{1}{{\operatorname{len} \left( {C_i^b} \right)}}\sum\nolimits_{F \in C_i^b} F } \right)$ serves as a negative sample. Inspired by InfoNCE [17], we propose the prototype contrastive loss L_pci.

$\begin{align*} & {d_{{\text{pos}}}} = {e^{\cos \left( {P_Q^f,P_s^f} \right)}} + {e^{\cos \left( {P_Q^f,P_0^f} \right)}} + {e^{\cos \left( {P_Q^f,P_t^f} \right)}},\tag{12} \\ & {L_{pci}} = - \log \frac{{{d_{pos}}}}{{{d_{pos}} + {e^{\cos \left( {P_Q^f,P_t^b} \right)}}}},\tag{13}\end{align*}$ View Source

where, d_pos is the similarity of a positive pair.

$P_Q^f$

indicates exemplary prototype.

E. Training Loss

Considering the trade-off between model performance and efficiency, we set the number of feature matching iterations to three and use M₃ as the final prediction result. During the training phase, we supervise the last two feature-matching results as follows:

$\begin{equation*}{L_m} = BCE\left( {{M_2},M_q^f} \right) + BCE\left( {{M_3},M_q^f} \right),\tag{14}\end{equation*}$ View Source

where BCE represents the binary cross-entropy loss. Finally, we jointly optimize all the aforementioned losses to train the model in an end-to-end manner:

$\begin{equation*}L = {\lambda _1}{L_m} + {\lambda _2}{L_{pci}},\tag{15}\end{equation*}$

View Source

where λ₁ = 1.0, λ₂ = 0.1.

SECTION III.

Experiments

A. Dataset and Evaluation Metrics

We evaluated our model on two datasets, namely PASCAL-5ⁱ [3] and COCO-20ⁱ [18], which are widely used in previous few-shot semantic segmentation (FSS) methods. Similar to prior works [19], we divided the object categories in these two datasets into four folds and conducted experiments using cross-validation. The mean intersection over union (mIoU) was used as the primary evaluation metric for all experiments.

B. Implementation Details

We adopt the popular pre-trained models ResNet-101 [20] as the backbone. We discard the last block in the backbone and freeze the first two blocks. The initial learning rate is set to 1e⁻³, and we use the SGD optimizer with a momentum of 0.9 to update the parameters. Both support and query images are cropped to 400 × 400. We train the model for 20 epochs on both datasets with a batch size of 8.

C. Comparison with State-of-the-Art (SOTA)

Quantitative results. We compared the performance with the advanced few-shot semantic segmentation methods in recent years. As shown in Table I and II, we reported the best performance of these methods on the two mainstream datasets, PASCAL-5ⁱ and COCO-20ⁱ, and marked the backbone used for this performance. We use SSP [14] method to build our baseline model. As shown in Table I, our method significantly improves the performance over the baseline on PASCAL-5ⁱ, we achieve a 5.1% and 3.3% increase in mIoU for 1-shot and 5-shot, respectively, achieving the SOTA in both cases. Compared to PASCAL-5ⁱ, COCO-20ⁱ has more complex segmentation scenarios and is highly challenging. As shown in Table II, our method has also achieved remarkable results on this dataset, compared to the baseline, we achieve a 6.7% and 4.9% increase in mIoU for 1-shot and 5-shot, respectively, achieving the SOTA in both cases. It is worth noting that in addition to performance advantages, we also have significant advantages in training efficiency. Compared with most methods [26], [30], [31] that require 200 epochs of training, we only need 20 epochs to achieve remarkable performance, which confirms the effectiveness of our prototype-aware network.

TABLE I Performance comparison on PASCAL-5i under 1-shot and 5-shot settings.

TABLE II Performance comparison on COCO-20i under 1-shot and 5-shot settings.

Qualitative Results. Our method effectively overcomes the intra-class diversity problem. As shown in Fig. 2, when there are marked appearance disparities between the support and query images, such as size, color, and shape, we can achieve segmentation tasks well.

D. Ablation Study

Ablation studies are conducted to investigate the impact of each component on segmentation performance and the performance variations under different settings. The experiments are conducted using the 1-shot segmentation performance on the PASCAL-5ⁱ dataset with the ResNet-101 backbone.

Effectiveness of SSPA Components. The three components in SSPA, namely Cycle Consistency Collection (CCC), Self-Support Collection (SSC), and Prototype Contrastive Inspection (PCI) are crucial. We sequentially examine the effectiveness of these components in Table III. CCC, which combines channel attention and cycle consistency, filters out high-quality query foreground features to guide feature matching, resulting in a 2.1% performance improvement. SSC optimizes the initial segmentation prediction, making the subsequent feature selection more reliable. It also filters out query features with significant semantic signals based on cascade prediction to construct high-quality query prototypes for achieving fine-grained segmentation prediction, leading to a 1.6% performance improvement. PCI aims to decouple the target objects and verify the feature selection results to better supervise the selecting quality of the above CCC and SSC, achieving a 1.4% performance improvement by introducing contrastive learning. These three modules cooperate with each other to jointly promote the generation of high-quality query prototypes, greatly improve the prototype perception ability of the network, effectively alleviate the intra-class diversity problem, and achieve top-notch performance. These significant improvements verify the effectiveness of the modules we proposed.

Fig. 2.

Qualitative results of the proposed SSPA and baseline approach. From top to bottom: support images, query images, prediction of baseline, prediction of SSPA.

Show All

TABLE III Ablation study for each component of our approach on the PASCAL-5i.

SECTION IV.

Conclusion and Discussion

In this paper, we proposed a novel self-support prototype-aware learning network (SSPA) for few-shot semantic segmentation. We addressed the challenges of prototype-based methods caused by intra-class diversity and the limitations of support prototypes in accurately matching query foreground. We believe that our work opens up new possibilities for advancing the field of few-shot semantic segmentation and encourages further research in this area.

Limitation and Future Work. Although SSPA outperforms existing methods on various datasets, it fails to accurately segment details and edges, which requires more targeted approaches to represent the details and edges. Future work will be dedicated to improving the performance of the model in these challenging scenarios and making it a more robust and generalizable visual perception model.

References is not available for this document.

Self-Support Prototype-Aware For Few-Shot Semantic Segmentation

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction