Conferences >ICASSP 2025 - 2025 IEEE Inter...

GMM-Based Bootstrap Prototype-Aware Learning For Weakly Supervised Semantic Segmentation

Abstract:

Recent advancements in weakly supervised semantic segmentation (WSSS) methods have focused on incorporating contextual knowledge to enhance the completeness of class acti...Show More

Metadata

Abstract:

Recent advancements in weakly supervised semantic segmentation (WSSS) methods have focused on incorporating contextual knowledge to enhance the completeness of class activation maps (CAMs). Inspired by prototype learning, we introduce a novel approach that capitalizes on the diverse and fine-grained features of instances, as well as the non-discriminative features of background context. However, traditional clustering methods, commonly employed in prototype-based techniques to generate feature centroids, often lack sensitivity to noisy features. To address the limitation, that CAMs predominantly emphasize discriminative features, we propose a Gaussian Mixture Model Bootstrap Prototype Aware Learning (GMM-BPA) strategy. This enriches instance comprehension by integrating instance attributes and semantic context through a Bootstrap Prototype Aware CAM (BPACAM). The core of our method lies in capturing instance knowledge through attributes, which are learned from class support banks and background support banks modeled as Gaussian distributions. Extensive experiments conducted on two standard WSSS benchmarks, PASCAL VOC and MS COCO, demonstrate that GMM-BPA outperforms existing methods and achieves state-of-the-art performance.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10889275

Conference Location: Hyderabad, India

Contents

SECTION I.

Introduction

WSSS, which relies on image-level labels [1], represents one of the most formidable challenges within the WSSS domain. Our research is also centered on this particular subject.

The standard procedure encompasses three distinct stages. Initially, a classification model is trained to produce seeds for Class Activation Maps (CAMs). Subsequently, these CAMs are refined to generate pseudo-labels. Finally, a segmentation model is trained using these pseudo-labels in a fully supervised fashion. Specifically, Our research focuses on refining the CAM generation process to ensure more comprehensive coverage of entire objects, aligning with the second phase of the pipeline as mentioned above. Recent scholarly work, including studies cited in [2]–[4], aims to bolster the segmentation model’s precision and functionality by incorporating contextual knowledge. Furthermore, certain investigations, propelled by advancements in representation learning as per

[5], [6], advocate for the amalgamation of semantic context with instance-specific knowledge derived from global-scale modeling. This approach is intended to refine the semantic features of instances, as demonstrated in [7] and [8]. However, these studies have not sufficiently addressed the challenge of significant intra-class variation present in real-world scenarios. Traditional CAM and its derivatives often struggle with providing complete object coverage, as they are derived from discriminative models that inherently disregard regions with high co-occurrence of classes, as indicated in [8], [9]. This issue is particularly pronounced when training with a limited number of classes. Prototype learning (PL), introduced in [10], [11], posits that prototypes can encapsulate various aspects of object representation, ranging from local to global features or even object attributes. This concept holds promise for enhancing the discrimination and generalization capabilities of WSSS models.

Inspired by PL, we propose a novel approach termed GMM-BPA strategy, not only alleviating the knowledge bias from attributes of GMM-modeled features of one specific category but also mining the effects of spurious features from confusing contexts. Specifically, we hypothesize that the features extracted from the penultimate layer of a classification model adhere to GMM. Prototypes are then derived from these GMM-distributed features for each category and its corresponding background, employing the EM algorithm. To attain a comprehensive perspective, we incorporate categorical support banks, which facilitate an extensive understanding of intra-class feature diversity, thereby transcending the constraints imposed by mini-batch training. Finally, introduce a bootstrap prototype aware CAM that serves a dual purpose: it captures elusive local features of objects while suppressing indiscriminate features from classes that pose confusion. Recognizing that a limited set of instance features can bias the feature distribution, we advocate for a weighted feature distribution alignment. This alignment encourages the migration of features towards a globally aggregated categorical memory bank.

Conclusively, we have conducted extensive experimental validation on benchmark datasets PASCAL VOC [12] and MS COCO [13]. The results not only substantiate the effectiveness of our approach but also demonstrate its potential to achieve state-of-the-art performance in the field of weakly supervised semantic segmentation. Overall, the contributions of this work include the following aspects:

We propose a Gaussian Mixture-based prototype learning that alleviates knowledge bias between instance and context.
We propose a bootstrap prototype aware CAM to generate better pseudo labels.
We achieve a significant improvement and state-of-the-art performance on two benchmarks.

SECTION II.

Methodology

A. Preliminaries

We first briefly revisit GMM, whose definition is the weighted sum of K Gaissuin distribution, each with its own mean µ_k and covariance matrix $\sum k$ , providing a robust and flexible framework for data modeling. The probability density function (PDF) for a given data sample x is expressed as: $p(x) = \sum\nolimits_{k = 1}^K {{\pi _k}} {\mathcal{N}}\left({x|{\mu _k},\sum k}\right)$ where π_k denotes the mixing coefficient of the k-th Gaussian component, and $\nu \left({x|{\mu _k},\sum k } \right.$ represents the Gaussian distribution with mean µ_k and covariance $\sum k$ . The mixing coefficients are constrained by $\sum k$ = 1_K = 1 and π_k ≥ 0. Parameter estimation of GMM is typically performed by the Expection-Maximization (EM) algorithm. A detailed explanation of operations for GMM is following:

Step 1: Initialization. Initialize mean µ_k, covariance $\sum k$ , and mxing coefficient π_k for each Gaussian component k.

Step 2: Expection (E-step). Compute the responsibility γ_ik of each data point x_i for each component k as: $\gamma \left({{Z_{ik}}}\right) = \frac{{{\pi _k}{\mathcal{N}}\left({{x_i}|{\mu _k},\sum k}\right)}}{{{\sum _{j = 1}}K{\pi _j}{\mathcal{N}}{x_i}|{\mu _j},\sum (j)}}$ .

Step 3: Maximization (M-step). M-step updates the parameters of GMM to maximize the expected log-likelihood in E-step. The mixing coefficient π_k means µ_k and covariance matrix $\sum k$ update as follows:

$\begin{align*} & {\pi _k} = \frac{1}{N}\sum\limits_{i = 1}^N \gamma \left({{z_{ik}}}\right){\mu _k} = \frac{{\sum\nolimits_{i = 1}^N \gamma \left({{z_{ik}}{x_i}}\right)}}{{\sum\nolimits_{i = 1}^N \gamma \left({{z_{ik}}}\right)}}\tag{1} \\ & \sum\limits_k = \frac{{\sum\nolimits_{i = 1}^N \gamma \left({{z_{ik}}}\right)\left({{x_i} - {\mu _k}}\right){{\left({{x_i} - {\mu _k}}\right)}^T}}}{{\sum\nolimits_{i = 1}^N \gamma \left({{z_{ik}}}\right)}}\tag{2}\end{align*}$ View Source

Step 4: Iteration. Repeat the E-step and M-step until convergence, which is normally measured by the change in log-likelihood

Step 5: Model assessment. Once the algorithm converges the final parameters are set to characterize the GMM.

B. Framework Overview

In this work, we introduce a GMM-based bootstrap prototype aware strategy tailored for WSSS utilizing image-level annotations. Our framework adheres to the conventional WSSS pipeline, which involves several key steps: Initially, a classification model is trained to identify the object’s corresponding category. Subsequently, the features extracted by the backbone network are subjected to GMM modeling to generate prototypes for both the specific category and the background support banks. Consequently, we construct a GMM bootstrap prototype-aware Class Activation Map (CAM), which serves as a guide for generating labels to refine the mask through a mask refinement network. Ultimately, the pseudo-labels produced by the mask refinement process are employed to train a segmentation model using a fully supervised approach. This systematic strategy ensures the integration of weak supervision for effective semantic segmentation.

we denote a set of input images x ∈ R_3∗H∗W and the corresponding multi-hot classification labels y, where y = {0, 1}^N with N representing the total number of categories. The training dataset is formalized as $D = \left\{ {{x_i},{y_i}} \right\}_{i = 1}^M$ , where M denotes the total number of samples within the dataset. Features extracted from a backbone network are utilized to generate prototypes based on Gaussian distributions, encompassing both semantic and contextual prototypes for each category. We conceptualize the instance prototype as a query, with semantic and contextual prototypes derived from the support bank serving as keys. These keys represent the candidate features drawn from categorical attributes and potentially confusing attributes originating from the background. This framework facilitates the construction of a robust prototype-aware class activation map, which is pivotal for refining segmentation masks in a supervised learning paradigm.

We denote the output feature from a trained encoder as $f(x) \in {{\mathcal{R}}^{W*H*C}}$ , where C represents the number of channels, and H and W are the height and width of the feature map, respectively. Before leveraging Gaussian distribution to model the feature of each category, rough location information on foreground and background is required. This can be obtained through Class Activation Mapping (CAM), which can be represented as follows:

$\begin{equation*}{\operatorname{CAM} _n}(x) = \frac{{\operatorname{ReLU} \left({{A_n}}\right)}}{{\max \left({\operatorname{ReLU} \left({{A_n}}\right)}\right)}},\quad {A_n} = {\mathbf{w}}_n^ \top f(x).\tag{3}\end{equation*}$ View Source

where w_n denotes the corresponding weight in the fully connected (FC) layer.

C. Mixture of Gaussian-distributed Prototypes

Inspired by the paradigm of PL, our GMM-BPA strategy is designed to explore features from both the corresponding category support bank and the context support bank. We define the instance prototype as the query, with prototypes from the corresponding category support bank serving as categorical positive keys, and prototypes from the background support bank serving as contextual positive keys.

Modeling Instance Prototype as Quiry

The feature maps, f(x), extracted from the backbone network are processed by a projection head H to generate instance prototypes, following the transformation z = H(f(x)). Each instance prototype encapsulates the regional semantics observed in the input image I that pertain to a specific category. Specifically, each instance prototype is projected to a normalized vector $P_n^I \in {R^C}$ using a masked average pooling technique, as described in [14].

$\begin{equation*}{\mathcal{P}}_n^I = \frac{{\sum\nolimits_{x = 1,y = 1}^{W,H} {{{\mathbf{P}}_n}} (x,y)*z(x,y)}}{{\sum\nolimits_{x = 1,y = 1}^{W,H} {\mathbf{P}} (x,y)}},\tag{4}\end{equation*}$ View Source

where P_n(x, y) ∈ 0, 1 P denotes whether the pixel of class n is strongly activated in its activation map. In addition, strongly activated means CAMn(x > τ), where τ is a hyperparameter, representing the threshold of the reliability score. Thus,

$P_n^I$

is defined in a normalized and compact vector to explore relationships from plenty of prototypes from other samples.

Fig. 1.

The overall framework of the GMM-BPA method. Step 1: a standard classification model is trained using image-level annotations within the dataset. Step 2: Generating local prototypes from all training images for each class and its corresponding context with GMM bootstrap strategy and aggregating similarity map, GMM-BPACAM. Step 3: Generating labels for mask refinement methods with the guidance of GMM-BPACAM. Step 4: Train a segmentation model with mask-refined semantic pseudo labels.

Show All

Modeling Context prototypes as Key based on GMM

Based on CAM, every location on the feature map at the location is spatially classified into two sets, F and B as follows:

$\begin{equation*}f{(x)^{i,j}} \in \begin{cases} {{\mathcal{F}},}&{{\text{ if }}\operatorname{CAM} _n^{i,j}(x) \geq \tau } \\ {{\mathcal{B}},}&{{\text{ otherwise }}} \end{cases}\tag{5}\end{equation*}$ View Source

where f(x)^i,j denotes the local feature at location (i, j) on feature map.

${\mathcal{F}}$

denotes the set of foreground local features that are activated on CAM and

${\mathcal{B}}$

denotes the background local features. We assume that local and context features within a mini-batch only contain limited views of the class. Thus, we exploit two support banks for each corresponding category from all training datasets, where for one instance on the input image, After that, we perform GMM on

${\mathcal{F}}$

and

${\mathcal{B}}$

to obtain

${{\mathcal{G}}_F}$

and

${{\mathcal{G}}_B}$

distributions for each category, respectively. We denote the foreground support bank as

${\mathbf{F}}_{{\text{bank}}}^{\text{c}} = \left\{ {{{\mathbf{F}}_1},{{\mathbf{F}}_2}, \ldots ,{{\mathbf{F}}_{{G_F}}}} \right\}$

and the background support bank as

${\mathbf{B}}_{bank}^c = \left\{ {{{\mathbf{B}}_1},{{\mathbf{B}}_2}, \ldots ,{{\mathbf{B}}_{{G_B}}}} \right\}$

, where G_F and G_B denotes the number of Gaussian distributions for class c and its’ background.

Selecting Positive prototypes

Positive prototype selection is critical in our proposed approach since it heavily affects the quality of guidance for generating refined labels. Instance prototypes can represent the categorical attributes of the input image, while positive prototypes generalize category pattern from a comprehensive perspective and summarizes strong context features that will confuse model prediction. Our selection strategy applies positive and negative scores, $\omega _{{F_i}}^c,\omega _{{B_i}}^c$ to measure the relevance of instance attributes to the candidates of positive and negative attributes in the category and corresponding background. The positive candidates from the specific category are as follows:

$\begin{equation*}{{\mathcal{Z}}_F} = \frac{{\exp \left({{F_i}\cdot{w_n}}\right)}}{{\sum\nolimits_j^{{G_F}} {\exp } \left({{F_j}\cdot{w_j}}\right)}}\tag{6}\end{equation*}$ View Source

Meanwhile the positive candidates from the corresponding background can be followed as:

$\begin{equation*}{{\mathcal{Z}}_B} = \frac{{\exp \left({{B_i}\cdot{w_n}}\right)}}{{\sum\nolimits_j^{{B_F}} {\exp } \left({{B_j}\cdot{w_j}}\right)}}\tag{7}\end{equation*}$ View Source

where

${{\mathcal{Z}}_F}{\text{ and }}{{\mathcal{Z}}_B}$

are scores for selecting candidates. Intuitively, we set a threshold with a high value µ_F mu, µ_B to select high quality prototypes

${\mathbf{\tilde F}}_{{\text{bank }}}^c = \left\{ {{{\mathbf{F}}_1},{{\mathbf{F}}_2}, \ldots ,{{\mathbf{F}}_{{{\tilde G}_F}}}} \right\}$

and

${\mathbf{\tilde B}}_{{\text{bank }}}^{\text{c}} = \left\{ {{{\mathbf{B}}_1},{{\mathbf{B}}_2}, \ldots ,{{\mathbf{B}}_{{{\tilde G}_B}}}} \right\}$

Generating BPACAM

Every instance prototype represents the local visual pattern in the category: ${{\mathbb{P}}_i}{\text{ in }}{\mathcal{P}}_n^{\text{c}}$ for the class-related pattern and ${{\mathbb{B}}_i}{\text{ in }}{\mathcal{B}}_n^{\text{c}}$ for a context-related pattern which normally is associated with the class. With the cosine similarity map, the cosine similarities between features at each location and the corresponding category and background are aggregated as:

$\begin{equation*}\begin{array}{l} F{G_m} = \frac{1}{{{{\tilde G}_F}}}\sum\limits_{{{\mathbf{F}}_{\mathbf{i}}} \in {\mathbf{\tilde F}}_{{\text{bank }}}^c} {\operatorname{sim} } \left({f(x),{{\mathbf{F}}_{\mathbf{i}}}}\right), \\ B{G_m} = \frac{1}{{{{\tilde G}_B}}}\sum\limits_{{{\mathbf{B}}_{\mathbf{i}}} \in {\mathbf{\tilde F}}_{{\text{bank }}}^c} {\operatorname{sim} } \left({f(x),{{\mathbf{B}}_{\mathbf{i}}}}\right), \end{array}\tag{8}\end{equation*}$ View Source

where FG_m highlights the class regions from the input image correlated to the n − th prototypes and meanwhile BG_m highlights the context regions with a normalized value We adopt a bootstrap strategy to robust prototype banks, generating multi pairs of

${\mathbf{F}}_{{\text{bank }}}^c{\text{ and }}{\mathbf{B}}_{{\text{bank }}}^c$

following the same route. Thus we can generate BPACAM as follows:

$\begin{equation*}\begin{array}{l} {S_n} = \frac{1}{M}\sum\limits_{j = 1}^M {\left({F{G_m} - B{G_m}}\right)} , \\ BPACA{M_n}(x) = \frac{{\operatorname{Relu} \left({{S_n}}\right)}}{{\max \left({\operatorname{Relu} \left({{S_n}}\right)}\right)}} \end{array}\tag{9}\end{equation*}$

View Source

where Relu() and max() are activation functions and M is the number of bootstraps.

SECTION III.

Experiment

A. Setup

Datasets and Evaluation metric

Experiments were conducted on two benchmarks: PASCAL VOC [12] and MS COCO [13]. The PASCAL VOC [12] dataset comprises 4,369 images, partitioned into 1,464 for training, 1,449 for validation, and 1,456 for testing across 21 classes. MS COCO [13] consists of 81 classes with 82,783 training images and 40,454 validation images. Following [15]–[17], we adopted the expanded training set provided by [18], which includes 10,582 images. Mean Intersection over Union (mIoU) [19] is employed as the evaluation metric for both benchmarks.

Implement details

To fairly compare with other works, we adhere to the methodologies outlined in [7], [20] by utilizing a ResNet [21] architecture pre-trained on ImageNet [22]. For GMM modeling, we set the bootstrap number to 3, and the number of Gaussian components, denoted as G_F and G_B, to 8, 12, and 16 for each sampling iteration, respectively. The threshold for generating Class Activation Maps (CAMs) 3 is set at 0.1 for PASCAL VOC [12] and 0.25 for MS COCO [13]. Concurrently, the threshold for generating BPACAM is uniformly set at 0.3 across both datasets. For the selection of positive prototypes, the parameter µ_F is set to 0.9 for both datasets, while µ_Bis adjusted to 0.9 for PASCAL VOC [12] and 0.5 for MS COCO [13], reflecting the increased complexity of the latter dataset. Finally, we employ DeepLabV2 as our segmentation model. The generation of GMM prototypes is initiated once the classification model has reached convergence.

B. Results and Analysis

Table I shows the mIoU results of GMM-BPA, compared with other SOTA works. Our approach CLIP-ES+BPACAM achieves superior on VOC [12](mIoU of 74.8% on the val set and 74.9% on the test set.) and MSCOCO [13] (mIoU of 46.8%). With BPACAM, CLIP-ES improves by 1.6% on the COCO [13] val set. Similar improvement shows on IRN and AMN. Specifically, IRN + BPACAM has increased by around 1.3% on COCO [13] and 7.5% on VOC [12] test and AMN + BPACAM has 1.9% and 3.6% correspondingly.

TABLE I The mIoU results (%) based on DeepLabV2 on PASCAL VOC [12] and MS COCO [13]. The side column shows three backbones of multi-label classification model.

${\mathcal{I}}$ denote using image-level labels. S denotes using saliency maps.

${\mathcal{L}}$ denotes using Language supervision.

$Table I- The mIoU results (%) based on DeepLabV2 on PASCAL VOC [12] and MS COCO [13]. The side column shows three backbones of multi-label classification model. ${\mathcal{I}}$ denote using image-level labels. S denotes using saliency maps. ${\mathcal{L}}$ denotes using Language supervision.$

C. Ablation Study

Our ablation study focuses on the number of distributions and the number of bootstrap sampling iterations. We experimented with IRN on voc val and test set. We set the G_F and G_B to values of 8, 12, 16. The performance is not sensitive to the number when bootstrapped once, with a performance of about 72% on the VOC [12] test. We achieve the best performance when G_F and G_B are set to 12, resulting in a mIoU of 71.7% on the VOC [12] test set. When the bootstrap strategy is applied, an average improvement of 0.5% is observed, indicating that bootstrap provides a more robust estimation.

SECTION IV.

Conclusion

In this paper, we introduce the GMM-BPA strategy, designed to harness both instance-specific and contextual knowledge. The essence of our methodology lies in the integration of GMM-based feature extraction with prototype learning. Our approach has been rigorously evaluated on two benchmark datasets, and the results demonstrate that it surpasses current state-of-the-art techniques in terms of performance.

References is not available for this document.

GMM-Based Bootstrap Prototype-Aware Learning For Weakly Supervised Semantic Segmentation

Abstract:

Metadata

Abstract:

ISSN Information:

Introduction