Introduction
WSSS, which relies on image-level labels [1], represents one of the most formidable challenges within the WSSS domain. Our research is also centered on this particular subject.
The standard procedure encompasses three distinct stages. Initially, a classification model is trained to produce seeds for Class Activation Maps (CAMs). Subsequently, these CAMs are refined to generate pseudo-labels. Finally, a segmentation model is trained using these pseudo-labels in a fully supervised fashion. Specifically, Our research focuses on refining the CAM generation process to ensure more comprehensive coverage of entire objects, aligning with the second phase of the pipeline as mentioned above. Recent scholarly work, including studies cited in [2]–[4], aims to bolster the segmentation model’s precision and functionality by incorporating contextual knowledge. Furthermore, certain investigations, propelled by advancements in representation learning as per
[5], [6], advocate for the amalgamation of semantic context with instance-specific knowledge derived from global-scale modeling. This approach is intended to refine the semantic features of instances, as demonstrated in [7] and [8]. However, these studies have not sufficiently addressed the challenge of significant intra-class variation present in real-world scenarios. Traditional CAM and its derivatives often struggle with providing complete object coverage, as they are derived from discriminative models that inherently disregard regions with high co-occurrence of classes, as indicated in [8], [9]. This issue is particularly pronounced when training with a limited number of classes. Prototype learning (PL), introduced in [10], [11], posits that prototypes can encapsulate various aspects of object representation, ranging from local to global features or even object attributes. This concept holds promise for enhancing the discrimination and generalization capabilities of WSSS models.
Inspired by PL, we propose a novel approach termed GMM-BPA strategy, not only alleviating the knowledge bias from attributes of GMM-modeled features of one specific category but also mining the effects of spurious features from confusing contexts. Specifically, we hypothesize that the features extracted from the penultimate layer of a classification model adhere to GMM. Prototypes are then derived from these GMM-distributed features for each category and its corresponding background, employing the EM algorithm. To attain a comprehensive perspective, we incorporate categorical support banks, which facilitate an extensive understanding of intra-class feature diversity, thereby transcending the constraints imposed by mini-batch training. Finally, introduce a bootstrap prototype aware CAM that serves a dual purpose: it captures elusive local features of objects while suppressing indiscriminate features from classes that pose confusion. Recognizing that a limited set of instance features can bias the feature distribution, we advocate for a weighted feature distribution alignment. This alignment encourages the migration of features towards a globally aggregated categorical memory bank.
Conclusively, we have conducted extensive experimental validation on benchmark datasets PASCAL VOC [12] and MS COCO [13]. The results not only substantiate the effectiveness of our approach but also demonstrate its potential to achieve state-of-the-art performance in the field of weakly supervised semantic segmentation. Overall, the contributions of this work include the following aspects:
We propose a Gaussian Mixture-based prototype learning that alleviates knowledge bias between instance and context.
We propose a bootstrap prototype aware CAM to generate better pseudo labels.
We achieve a significant improvement and state-of-the-art performance on two benchmarks.
Methodology
A. Preliminaries
We first briefly revisit GMM, whose definition is the weighted sum of K Gaissuin distribution, each with its own mean µk and covariance matrix
Step 1: Initialization. Initialize mean µk, covariance
Step 2: Expection (E-step). Compute the responsibility γik of each data point xi for each component k as:
Step 3: Maximization (M-step). M-step updates the parameters of GMM to maximize the expected log-likelihood in E-step. The mixing coefficient πk means µk and covariance matrix \begin{align*} & {\pi _k} = \frac{1}{N}\sum\limits_{i = 1}^N \gamma \left({{z_{ik}}}\right){\mu _k} = \frac{{\sum\nolimits_{i = 1}^N \gamma \left({{z_{ik}}{x_i}}\right)}}{{\sum\nolimits_{i = 1}^N \gamma \left({{z_{ik}}}\right)}}\tag{1} \\ & \sum\limits_k = \frac{{\sum\nolimits_{i = 1}^N \gamma \left({{z_{ik}}}\right)\left({{x_i} - {\mu _k}}\right){{\left({{x_i} - {\mu _k}}\right)}^T}}}{{\sum\nolimits_{i = 1}^N \gamma \left({{z_{ik}}}\right)}}\tag{2}\end{align*}
Step 4: Iteration. Repeat the E-step and M-step until convergence, which is normally measured by the change in log-likelihood
Step 5: Model assessment. Once the algorithm converges the final parameters are set to characterize the GMM.
B. Framework Overview
In this work, we introduce a GMM-based bootstrap prototype aware strategy tailored for WSSS utilizing image-level annotations. Our framework adheres to the conventional WSSS pipeline, which involves several key steps: Initially, a classification model is trained to identify the object’s corresponding category. Subsequently, the features extracted by the backbone network are subjected to GMM modeling to generate prototypes for both the specific category and the background support banks. Consequently, we construct a GMM bootstrap prototype-aware Class Activation Map (CAM), which serves as a guide for generating labels to refine the mask through a mask refinement network. Ultimately, the pseudo-labels produced by the mask refinement process are employed to train a segmentation model using a fully supervised approach. This systematic strategy ensures the integration of weak supervision for effective semantic segmentation.
we denote a set of input images x ∈ R3∗H∗W and the corresponding multi-hot classification labels y, where y = {0, 1}N with N representing the total number of categories. The training dataset is formalized as
We denote the output feature from a trained encoder as \begin{equation*}{\operatorname{CAM} _n}(x) = \frac{{\operatorname{ReLU} \left({{A_n}}\right)}}{{\max \left({\operatorname{ReLU} \left({{A_n}}\right)}\right)}},\quad {A_n} = {\mathbf{w}}_n^ \top f(x).\tag{3}\end{equation*}
C. Mixture of Gaussian-distributed Prototypes
Inspired by the paradigm of PL, our GMM-BPA strategy is designed to explore features from both the corresponding category support bank and the context support bank. We define the instance prototype as the query, with prototypes from the corresponding category support bank serving as categorical positive keys, and prototypes from the background support bank serving as contextual positive keys.
Modeling Instance Prototype as Quiry
The feature maps, f(x), extracted from the backbone network are processed by a projection head H to generate instance prototypes, following the transformation z = H(f(x)). Each instance prototype encapsulates the regional semantics observed in the input image I that pertain to a specific category. Specifically, each instance prototype is projected to a normalized vector \begin{equation*}{\mathcal{P}}_n^I = \frac{{\sum\nolimits_{x = 1,y = 1}^{W,H} {{{\mathbf{P}}_n}} (x,y)*z(x,y)}}{{\sum\nolimits_{x = 1,y = 1}^{W,H} {\mathbf{P}} (x,y)}},\tag{4}\end{equation*}
The overall framework of the GMM-BPA method. Step 1: a standard classification model is trained using image-level annotations within the dataset. Step 2: Generating local prototypes from all training images for each class and its corresponding context with GMM bootstrap strategy and aggregating similarity map, GMM-BPACAM. Step 3: Generating labels for mask refinement methods with the guidance of GMM-BPACAM. Step 4: Train a segmentation model with mask-refined semantic pseudo labels.
Modeling Context prototypes as Key based on GMM
Based on CAM, every location on the feature map at the location is spatially classified into two sets, F and B as follows:
\begin{equation*}f{(x)^{i,j}} \in \begin{cases} {{\mathcal{F}},}&{{\text{ if }}\operatorname{CAM} _n^{i,j}(x) \geq \tau } \\ {{\mathcal{B}},}&{{\text{ otherwise }}} \end{cases}\tag{5}\end{equation*}
Selecting Positive prototypes
Positive prototype selection is critical in our proposed approach since it heavily affects the quality of guidance for generating refined labels. Instance prototypes can represent the categorical attributes of the input image, while positive prototypes generalize category pattern from a comprehensive perspective and summarizes strong context features that will confuse model prediction. Our selection strategy applies positive and negative scores, \begin{equation*}{{\mathcal{Z}}_F} = \frac{{\exp \left({{F_i}\cdot{w_n}}\right)}}{{\sum\nolimits_j^{{G_F}} {\exp } \left({{F_j}\cdot{w_j}}\right)}}\tag{6}\end{equation*}
Meanwhile the positive candidates from the corresponding background can be followed as:
\begin{equation*}{{\mathcal{Z}}_B} = \frac{{\exp \left({{B_i}\cdot{w_n}}\right)}}{{\sum\nolimits_j^{{B_F}} {\exp } \left({{B_j}\cdot{w_j}}\right)}}\tag{7}\end{equation*}
Generating BPACAM
Every instance prototype represents the local visual pattern in the category: \begin{equation*}\begin{array}{l} F{G_m} = \frac{1}{{{{\tilde G}_F}}}\sum\limits_{{{\mathbf{F}}_{\mathbf{i}}} \in {\mathbf{\tilde F}}_{{\text{bank }}}^c} {\operatorname{sim} } \left({f(x),{{\mathbf{F}}_{\mathbf{i}}}}\right), \\ B{G_m} = \frac{1}{{{{\tilde G}_B}}}\sum\limits_{{{\mathbf{B}}_{\mathbf{i}}} \in {\mathbf{\tilde F}}_{{\text{bank }}}^c} {\operatorname{sim} } \left({f(x),{{\mathbf{B}}_{\mathbf{i}}}}\right), \end{array}\tag{8}\end{equation*}
\begin{equation*}\begin{array}{l} {S_n} = \frac{1}{M}\sum\limits_{j = 1}^M {\left({F{G_m} - B{G_m}}\right)} , \\ BPACA{M_n}(x) = \frac{{\operatorname{Relu} \left({{S_n}}\right)}}{{\max \left({\operatorname{Relu} \left({{S_n}}\right)}\right)}} \end{array}\tag{9}\end{equation*}
Experiment
A. Setup
Datasets and Evaluation metric
Experiments were conducted on two benchmarks: PASCAL VOC [12] and MS COCO [13]. The PASCAL VOC [12] dataset comprises 4,369 images, partitioned into 1,464 for training, 1,449 for validation, and 1,456 for testing across 21 classes. MS COCO [13] consists of 81 classes with 82,783 training images and 40,454 validation images. Following [15]–[17], we adopted the expanded training set provided by [18], which includes 10,582 images. Mean Intersection over Union (mIoU) [19] is employed as the evaluation metric for both benchmarks.
Implement details
To fairly compare with other works, we adhere to the methodologies outlined in [7], [20] by utilizing a ResNet [21] architecture pre-trained on ImageNet [22]. For GMM modeling, we set the bootstrap number to 3, and the number of Gaussian components, denoted as GF and GB, to 8, 12, and 16 for each sampling iteration, respectively. The threshold for generating Class Activation Maps (CAMs) 3 is set at 0.1 for PASCAL VOC [12] and 0.25 for MS COCO [13]. Concurrently, the threshold for generating BPACAM is uniformly set at 0.3 across both datasets. For the selection of positive prototypes, the parameter µF is set to 0.9 for both datasets, while µBis adjusted to 0.9 for PASCAL VOC [12] and 0.5 for MS COCO [13], reflecting the increased complexity of the latter dataset. Finally, we employ DeepLabV2 as our segmentation model. The generation of GMM prototypes is initiated once the classification model has reached convergence.
B. Results and Analysis
Table I shows the mIoU results of GMM-BPA, compared with other SOTA works. Our approach CLIP-ES+BPACAM achieves superior on VOC [12](mIoU of 74.8% on the val set and 74.9% on the test set.) and MSCOCO [13] (mIoU of 46.8%). With BPACAM, CLIP-ES improves by 1.6% on the COCO [13] val set. Similar improvement shows on IRN and AMN. Specifically, IRN + BPACAM has increased by around 1.3% on COCO [13] and 7.5% on VOC [12] test and AMN + BPACAM has 1.9% and 3.6% correspondingly.
C. Ablation Study
Our ablation study focuses on the number of distributions and the number of bootstrap sampling iterations. We experimented with IRN on voc val and test set. We set the GF and GB to values of 8, 12, 16. The performance is not sensitive to the number when bootstrapped once, with a performance of about 72% on the VOC [12] test. We achieve the best performance when GF and GB are set to 12, resulting in a mIoU of 71.7% on the VOC [12] test set. When the bootstrap strategy is applied, an average improvement of 0.5% is observed, indicating that bootstrap provides a more robust estimation.
Conclusion
In this paper, we introduce the GMM-BPA strategy, designed to harness both instance-specific and contextual knowledge. The essence of our methodology lies in the integration of GMM-based feature extraction with prototype learning. Our approach has been rigorously evaluated on two benchmark datasets, and the results demonstrate that it surpasses current state-of-the-art techniques in terms of performance.