Conferences >ICASSP 2025 - 2025 IEEE Inter...

Content-Aware Dynamic Superpixel Segmentation

Abstract:

In recent years, deep learning-based superpixel segmentation methods derived from SLIC have made significant progress by utilizing uniform grid-based seed initialization....Show More

Metadata

Abstract:

In recent years, deep learning-based superpixel segmentation methods derived from SLIC have made significant progress by utilizing uniform grid-based seed initialization. However, due to the unequal pixel space variation rates in natural images, methods based on uniform grid initialization struggle to balance the compactness of superpixels in flat regions with the boundary adherence in non-flat regions. Inspired by the visual attention model based on saliency in the human visual system, we propose a content-aware dynamic superpixel segmentation network. Specifically, we propose a seed initialization strategy guided by geodesic distance transformation and design two segmentation heads for different scales, which are used for joint network training to encourage the network to focus more on areas with texture variations without causing unnecessary segmentation in flat regions. Extensive experiments on BSDS500 and NYUv2 datasets demonstrate that our method achieves state-of-the-art performance.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10890634

Conference Location: Hyderabad, India

Contents

SECTION I.

Introduction

Superpixel segmentation algorithms play a crucial role in transforming the original image into visually significant regions. In this process, they not only group similar pixels at a lower level but also ensure that the resulting superpixels retain independent semantic information. Superpixel segmentation algorithms improve the efficiency of image representation and reduce data dimensionality, making them indispensable preprocessing steps in various computer vision tasks, such as semantic segmentation [1] [2] [3], object detection [4] [5], optical flow estimation [6] [7] and more.

Superpixel segmentation has achieved significant progress over the past two decades. Traditional superpixel segmentation methods are broadly categorized into gradient-based and graph-based approaches. Gradient-based approaches, such as SLIC offer computational effi-ciency but struggle with boundary adherence. Other clustering-based methods [8] [9] have improved boundary adherence by employing different distance measurement strategies. on the other hand, graph-based segmentation approaches transform the segmentation problem into energy function minimization, treating image pixels as graph nodes and assigning weights to the edges between nodes. These algorithms generally preserve image boundary information well but result in less compact superpixels [10] [11].

Recently proposed deep learning-based superpixel segmentation methods, such as SSN [12], SCN [13], AINet [14], BINet [15], have evolved from the SLIC clustering algorithm. These methods still incorporate two main steps: generating initial superpixel (seeds) through uniform grid initialization, and iteratively learning pixel affiliations within a fixed neighborhood, ultimately assigning class labels to pixels. Initially, SSN replaced the nearest neighbor algorithm with a soft association between pixels and grids, making the algorithm differentiable and suitable for network backpropagation. Subsequently, SCN uses a fully convolutional network to directly predict association scores between image pixels and regular grid cells without the need for iterative clustering, thereby improving efficiency. Subsequent works have extended the end-to-end Encoder-Decoder approach of SCN. For instance, AINet directly associates high-level grid features with corresponding pixels, providing more effective contextual information. BINet incorporates neural structures and visual mechanisms into the superpixel segmentation network, innovatively mapping the superpixel segmentation network to the ventral information pathway of the human visual system.

These deep learning-based superpixel segmentation methods face challenges when segmenting objects with varying rates of pixel spatial changes, particularly within the same image. These difficulties stem from the reliance on uniformly initializing superpixels on a regular grid, which involves computing the relationships between each pixel and its fixed neighborhood. Such a seed initialization strategy often leads to the generation of multiple superpixels with similar characteristics in flat regions, resulting in unnecessary segmentation. Furthermore, during the iterative update of superpixel centers, the model might overlook small object features due to the predominance of pixels in flat regions. Thus, it is crucial to adaptively initialize superpixel seeds based on the content of the image.

To address these challenges, we draw inspiration from the saliency-based model of visual attention. Our model generates superpixels with improved boundary adherence in non-flat regions by identifying image regions with different spatial variation rates. Simultaneously, it allocates less attention to flat regions, resulting in superpixels that are more compact. Specifically, inspired by the visual attention mechanism, we propose a seed initialization strategy based on geodesic distance transformation. This strategy perceives image content by calculating the similarity between pixels within the image. We binarize the geodesic distance transformed image to segment each image into flat and non-flat regions. Subsequently,We then develop a network with two output heads to train pixel-superpixel associations at two scales, and then merged the pixel attribute reconstruction results at the two scales into the loss function. This allows the network to perform joint training under seed initialization strategies at two scales and to output the corresponding association maps.

We summarize the contributions of this work: (1) A novel seed ini-tialization strategy that perceives to image content, utilizing geodesic distance transformation, is proposed. (2) The proposed CDNet includes two segmentation heads, with the innovative integration of scale into the loss function, enabling the network to perform collaborative training on seed initialization grids at two different scales. (3) Inspired by biological vision mechanism, we propose CDNet, the first deep learning-based superpixel segmentation network capable of adaptively segmenting superpixels of different sizes based on image content.

SECTION II.

The Proposed Method

A. Seed Initialisation Strategy

Drawing inspiration from the visual attention mechanism in human vision, we propose a seed initialization strategy based on geodesic distance transform. The geodesic distance transform can be used to calculate the similarity between two pixels. As shown in Figure 1, despite the Euclidean distances between A and B and between B and C being identical, the geodesic distance between A and B is greater than that between B and C. Unlike Euclidean distance, geodesic distance takes into account obstacles and topological structures in space. Therefore, we use the geodesic distance transform to obtain texture structure information of the image.

Fig. 1:

Comparison of geodesic distances of different pixels, d_g represents geodesic distance, d_e represents Euclidean distance.

Show All

Formally, given an image I ∈ ℛ^H×W×3 defined on a 2D domain Ψ, the geodesic distance [17] between two pixels P_a, P_b derived from ψ is defined as:

$\begin{equation*}D\left( {{P_a},{P_b}} \right) = \mathop {\min }\limits_{\Gamma \in {\mathcal{P}_{{P_a},{P_b}}}} \int_0^1 {\sqrt {{{\left\| {{\Gamma ^\prime }(s)} \right\|}^2} + {{\left( {\nabla I \cdot \frac{{{\Gamma ^\prime }(s)}}{{\left\| {{\Gamma ^\prime }(s)} \right\|}}} \right)}^2}} } ds\tag{1}\end{equation*}$ View Source

Where ${\mathcal{P}_{{P_a},{P_b}}}$ is the set of all paths between two pixel P_a and P_b; Γ(s) is indicated as a path in the set of paths and is parameterized by s ∈ [0, 1], Γ′(s) is equal to ∂Γ(s)/∂s, indicating the spatial derivative. Γ′(s)/∥Γ′(s)∥ is tangent to the direction of the path. To obtain texture information, we define P_a = {(i, j) | 1 ≤ i ≤ H, 1 ≤ j ≤ W} and P_b = {(i, j) | i = 1 + 8k, j = 1+8l, where k and l are non-negative integers, 1 ≤ i ≤ H, and 1 ≤ j ≤ W}. To enhance computational efficiency, we utilize publicly available code capable of accelerated execution on the GPU [17] to compute the geodesic distance.

Then, we use an average threshold to binarize the geodesic distance-transformed image, dividing each image into flat and non-flat regions. This allows for a divide-and-conquer approach to customize superpixel initialization for different image regions. Moreover, to obtain more reliable edges, we filter out some contours through curvature. Subsequently, we discard contours with small areas, as larger superpixel initialization is required for flat regions. Finally, pixels within the retained contours are assigned a value of 1, indicating flat areas with low pixel spatial variation, while the remaining areas represent non-flat regions. The pseudocode for this process is described in Algorithm 1.

B. Joint Segmentation Network

In recent years, deep learning-based superpixel segmentation methods, particularly those following the SCN approach, have utilized encoder-decoder networks to predict association maps Q in an end-to-end manner. It is noteworthy that the network structures of these methods cannot accommodate strategies for initializing grids of different sizes.

To address the challenges, we introduce two output head layers after the decoder module for the joint training of association maps at different scales. As shown in Figure 3, the image undergoes geodesic distance transformation and thresholding to derive boundary information, producing a binary mask. Pixels with a mask value of 1 indicate non-flat regions, whereas those with a value of 0 indicate flat regions. Therefore, this mask can be used to guide the network in initializing superpixels with appropriate grid sizes at the corresponding locations. Finally, the feature maps from the decoder module traverse two parallel convolutional layers and output the association maps Q₁ and Q₂ ∈ ℛ^H×W×9. The association maps Q₁ and Q₂ represent the probability of each pixel being associated with its surrounding large-scale grids and small-scale grids, respectively.

Algorithm 1: The Proposed Seed Initialization Strategy.

Table 1:-
The Proposed Seed Initialization Strategy.

Fig. 2:

Our Seed Initialization Strategy vs. Other Techniques. Left column: Others, Right column: Ours. We initialize two grids of different scales and merge the two scales based on the boundary obtained by geodesic distance transformation.

Show All

Due to the absence of superpixel labels, the network is unable to perform supervised learning to train the association map Q. Therefore, Q is utilized as an intermediate variable to reconstruct pixel attributes, such as semantic labels and position vectors, and the loss is calculated between the reconstructed and actual pixel labels. The training phase can be summarized into two essential steps. The first step is to assign pixel-to-superpixel relationships using the association map Q and compute the superpixel center attributes. We take into account the initialization grid size in this step, as follows:

$\begin{equation*}{\mathbf{l}}({\mathbf{s}},d) = \frac{{\sum\nolimits_{{\mathbf{p}}:{\mathbf{s}} \in \mathcal{N}_{\mathbf{p}}^d} {\operatorname{avgpool} } \left( {{\mathbf{f}}({\mathbf{p}}) \odot {q_d}({\mathbf{p}},{\mathbf{s}})} \right)}}{{\sum\nolimits_{{\mathbf{p}}:{\mathbf{s}} \in \mathcal{N}_{\mathbf{p}}^d} {{q_d}} ({\mathbf{p}},{\mathbf{s}})}}\tag{2}\end{equation*}$ View Source

Here, d denotes the grid size, with d = 16 and d = 32 being used in our implementation. The s represents the superpixel. $\mathcal{N}_{\mathbf{p}}^d$ denotes the set of superpixels surrounding p at scale d, f(p) includes the semantic and positional attributes of pixel p, and q(p, s) represents the probability that pixel p belongs to superpixel s.

Fig. 3:

The proposed CDNet adopts a geodesic distance-based seed initialization strategy to learn two-scale pixel-to-superpixel association maps Q and merge them to obtain the final segmentation result.

Show All

In the second step, pixel attributes are reconstructed using the superpixel center attributes and the association map, as follows:

$\begin{equation*}{{\mathbf{f}}^\prime }({\mathbf{p}},d) = \sum\limits_{{\mathbf{s}} \in \mathcal{N}_{\mathbf{p}}^d} {{{\operatorname{inter} }_{d,{\text{nn}}}}} ({\mathbf{l}}({\mathbf{s}},d)) \cdot {q_d}({\mathbf{p}},{\mathbf{s}})\tag{3}\end{equation*}$ View Source

Here, inter_d,nn denotes nearest neighbor interpolation with a step size equal to d. The f′(p, d) is obtained by reconstructing the pixel labels from the association map.

To group pixels with similar attributes, we use cross-entropy loss, and to enhance spatial compactness within superpixels, we apply the ℓ₂ norm. The overall network training loss comprises two components: cross-entropy (CE) loss for the semantic label and L2 reconstruction loss for position vector. The overall loss function ℒ is formulated as follows:

$\begin{equation*}\mathcal{L} = \sum\limits_p C E\left( {{{\mathbf{f}}^\prime }({\mathbf{p}},d),{\mathbf{f}}({\mathbf{p}})} \right) + \frac{m}{d}{\left\| {{\mathbf{p}} - {{\mathbf{p}}^\prime }} \right\|_2},\tag{4}\end{equation*}$ View Source

where m is the weight balancing the two terms. The loss function consists of the sum of the cross entropy loss and the L2 loss over all pixels, where the pixels are reconstructed at different scales (grid sizes).

SECTION III.

Experiment

A. Datasets and Implementation Details

To evaluate the efficiency of our approach, experiments were performed on BSDS500 [18] and NYUv2 [19] datasets. BSDS500 includes 200 training, 100 validation, and 200 test images, yielding 1,087 training, 546 validation, and 1,063 test samples. The NYUv2 dataset, aimed at indoor scene understanding, comprises 1,449 annotated images. Following previous works, unlabeled boundary regions are excluded, and a subset of 400 test images (608×448 pixels) is selected for superpixel evaluation.

For the training phase, input images from the BSDS500 dataset are randomly cropped to 224×224 pixels. The initial learning rate is set to 8e-5, and it is halved after 8,000 iterations. Using the Adam optimizer, the model is trained for 200 epochs with a batch size of 8. All experiments are implemented within the PyTorch framework on a workstation equipped with an Intel Core i7 CPU and an RTX 2080 GPU.

B. Comparison with the State-Of-The-Arts

We assess the performance of superpixel methods using three metrics: achievable segmentation accuracy (ASA), boundary recall and precision (BR-BP), and compactness (CO). The ASA score measures the upper limit of segmentation accuracy achievable with superpixels as a pre-processing step, whereas BR and BP evaluate the superpixel model’s ability to identify semantic boundaries effectively. CO assesses the compactness of the superpixels. Therefore, the BR-BP and CO metrics can effectively evaluate the adherence of superpixels in non-flat regions and the compactness of superpixels in flat regions. A comprehensive performance comparison on BSDS500 and NYUv2 datasets is illustrated in Figures 4-5.

(1) Results on BSDS500

As shown in Figure 4, we compare CDNet with other superpixel segmentation methods on the BSDS500 dataset using metrics such as ASA, BR-BP, and CO. The results demonstrate that CDNet outperforms other methods, particularly when the number of superpixels is large. This is likely because methods based on uniform superpixel initialization strategies tend to produce unnecessary segmentation in flat regions when the super-pixel count is high. Although CDNet’s performance in segmentation accuracy is slightly inferior when the number of superpixels is low, it still remains comparable to other deep learning-based methods such as AINet. Overall, the proposed CDNet, which employs a divide-and-conquer approach based on image content, is capable of generating appropriate superpixel segmentation results for both flat and non-flat regions.

(2) Results on NYUv2

Figure 5 depicts the comparison results of CDNet on the NYUv2 dataset without any fine-tuning. Similarly, the proposed CDNet outperforms other superpixel segmentation algorithms in all three metrics. The performance on NYUv2 shows no significant correlation with the number of superpixels. We attribute this to the NYUv2 dataset being primarily used for indoor scene understanding, where the presence of more diverse objects results in larger non-flat regions.

Fig. 4:

Performance comparison on datasets BSDS500.

Show All

Fig. 5:

Performance comparison on datasets NYUv2.

Show All

Fig. 6:

Qualitative results of four SOTA superpixel methods, SCN, AINet, BINet and our method. The top row displays the results for the BSDS500 dataset, the bottom row shows the results for the NYUv2 dataset.

Show All

To visually demonstrate the performance of the proposed CDNet across different image domains, we present the qualitative comparison results with three other state-of-the-art methods, including SCN [13], AINet [14], and BINet [15], on three datasets in Figure 6. In each image, we highlight two distinct regions: flat areas and detailed non-flat areas. It is evident that our model avoids unnecessary segmentation in flat regions while retaining the extraction of small target contours in non-flat regions. Overall, our model clearly produces appropriate segmentation for different image areas based on the image content.

SECTION IV.

Conclusion

To address the challenges of deep learning-based superpixel segmentation stemming from SLIC’s uniform grid initialization—balancing superpixel compactness in flat regions and boundary adherence in non-flat regions—we propose a content-aware dynamic segmentation network inspired by human visual saliency. Our approach introduces geodesic distance-based seed initialization and dual segmentation heads at different scales for joint training, enabling multi-scale superpixel segmentation within the same image. Experiments on BSDS500 and NYUv2 datasets show that our method achieves state-of-the-art performance with efficient inference.

References is not available for this document.

Content-Aware Dynamic Superpixel Segmentation

Abstract:

Metadata

Abstract:

ISSN Information:

Introduction