Introduction
Superpixel segmentation algorithms play a crucial role in transforming the original image into visually significant regions. In this process, they not only group similar pixels at a lower level but also ensure that the resulting superpixels retain independent semantic information. Superpixel segmentation algorithms improve the efficiency of image representation and reduce data dimensionality, making them indispensable preprocessing steps in various computer vision tasks, such as semantic segmentation [1] [2] [3], object detection [4] [5], optical flow estimation [6] [7] and more.
Superpixel segmentation has achieved significant progress over the past two decades. Traditional superpixel segmentation methods are broadly categorized into gradient-based and graph-based approaches. Gradient-based approaches, such as SLIC offer computational effi-ciency but struggle with boundary adherence. Other clustering-based methods [8] [9] have improved boundary adherence by employing different distance measurement strategies. on the other hand, graph-based segmentation approaches transform the segmentation problem into energy function minimization, treating image pixels as graph nodes and assigning weights to the edges between nodes. These algorithms generally preserve image boundary information well but result in less compact superpixels [10] [11].
Recently proposed deep learning-based superpixel segmentation methods, such as SSN [12], SCN [13], AINet [14], BINet [15], have evolved from the SLIC clustering algorithm. These methods still incorporate two main steps: generating initial superpixel (seeds) through uniform grid initialization, and iteratively learning pixel affiliations within a fixed neighborhood, ultimately assigning class labels to pixels. Initially, SSN replaced the nearest neighbor algorithm with a soft association between pixels and grids, making the algorithm differentiable and suitable for network backpropagation. Subsequently, SCN uses a fully convolutional network to directly predict association scores between image pixels and regular grid cells without the need for iterative clustering, thereby improving efficiency. Subsequent works have extended the end-to-end Encoder-Decoder approach of SCN. For instance, AINet directly associates high-level grid features with corresponding pixels, providing more effective contextual information. BINet incorporates neural structures and visual mechanisms into the superpixel segmentation network, innovatively mapping the superpixel segmentation network to the ventral information pathway of the human visual system.
These deep learning-based superpixel segmentation methods face challenges when segmenting objects with varying rates of pixel spatial changes, particularly within the same image. These difficulties stem from the reliance on uniformly initializing superpixels on a regular grid, which involves computing the relationships between each pixel and its fixed neighborhood. Such a seed initialization strategy often leads to the generation of multiple superpixels with similar characteristics in flat regions, resulting in unnecessary segmentation. Furthermore, during the iterative update of superpixel centers, the model might overlook small object features due to the predominance of pixels in flat regions. Thus, it is crucial to adaptively initialize superpixel seeds based on the content of the image.
To address these challenges, we draw inspiration from the saliency-based model of visual attention. Our model generates superpixels with improved boundary adherence in non-flat regions by identifying image regions with different spatial variation rates. Simultaneously, it allocates less attention to flat regions, resulting in superpixels that are more compact. Specifically, inspired by the visual attention mechanism, we propose a seed initialization strategy based on geodesic distance transformation. This strategy perceives image content by calculating the similarity between pixels within the image. We binarize the geodesic distance transformed image to segment each image into flat and non-flat regions. Subsequently,We then develop a network with two output heads to train pixel-superpixel associations at two scales, and then merged the pixel attribute reconstruction results at the two scales into the loss function. This allows the network to perform joint training under seed initialization strategies at two scales and to output the corresponding association maps.
We summarize the contributions of this work: (1) A novel seed ini-tialization strategy that perceives to image content, utilizing geodesic distance transformation, is proposed. (2) The proposed CDNet includes two segmentation heads, with the innovative integration of scale into the loss function, enabling the network to perform collaborative training on seed initialization grids at two different scales. (3) Inspired by biological vision mechanism, we propose CDNet, the first deep learning-based superpixel segmentation network capable of adaptively segmenting superpixels of different sizes based on image content.
The Proposed Method
A. Seed Initialisation Strategy
Drawing inspiration from the visual attention mechanism in human vision, we propose a seed initialization strategy based on geodesic distance transform. The geodesic distance transform can be used to calculate the similarity between two pixels. As shown in Figure 1, despite the Euclidean distances between A and B and between B and C being identical, the geodesic distance between A and B is greater than that between B and C. Unlike Euclidean distance, geodesic distance takes into account obstacles and topological structures in space. Therefore, we use the geodesic distance transform to obtain texture structure information of the image.
Comparison of geodesic distances of different pixels, dg represents geodesic distance, de represents Euclidean distance.
Formally, given an image I ∈ ℛH×W×3 defined on a 2D domain Ψ, the geodesic distance [17] between two pixels Pa, Pb derived from ψ is defined as:
\begin{equation*}D\left( {{P_a},{P_b}} \right) = \mathop {\min }\limits_{\Gamma \in {\mathcal{P}_{{P_a},{P_b}}}} \int_0^1 {\sqrt {{{\left\| {{\Gamma ^\prime }(s)} \right\|}^2} + {{\left( {\nabla I \cdot \frac{{{\Gamma ^\prime }(s)}}{{\left\| {{\Gamma ^\prime }(s)} \right\|}}} \right)}^2}} } ds\tag{1}\end{equation*}
Where
Then, we use an average threshold to binarize the geodesic distance-transformed image, dividing each image into flat and non-flat regions. This allows for a divide-and-conquer approach to customize superpixel initialization for different image regions. Moreover, to obtain more reliable edges, we filter out some contours through curvature. Subsequently, we discard contours with small areas, as larger superpixel initialization is required for flat regions. Finally, pixels within the retained contours are assigned a value of 1, indicating flat areas with low pixel spatial variation, while the remaining areas represent non-flat regions. The pseudocode for this process is described in Algorithm 1.
B. Joint Segmentation Network
In recent years, deep learning-based superpixel segmentation methods, particularly those following the SCN approach, have utilized encoder-decoder networks to predict association maps Q in an end-to-end manner. It is noteworthy that the network structures of these methods cannot accommodate strategies for initializing grids of different sizes.
To address the challenges, we introduce two output head layers after the decoder module for the joint training of association maps at different scales. As shown in Figure 3, the image undergoes geodesic distance transformation and thresholding to derive boundary information, producing a binary mask. Pixels with a mask value of 1 indicate non-flat regions, whereas those with a value of 0 indicate flat regions. Therefore, this mask can be used to guide the network in initializing superpixels with appropriate grid sizes at the corresponding locations. Finally, the feature maps from the decoder module traverse two parallel convolutional layers and output the association maps Q1 and Q2 ∈ ℛH×W×9. The association maps Q1 and Q2 represent the probability of each pixel being associated with its surrounding large-scale grids and small-scale grids, respectively.
Our Seed Initialization Strategy vs. Other Techniques. Left column: Others, Right column: Ours. We initialize two grids of different scales and merge the two scales based on the boundary obtained by geodesic distance transformation.
Due to the absence of superpixel labels, the network is unable to perform supervised learning to train the association map Q. Therefore, Q is utilized as an intermediate variable to reconstruct pixel attributes, such as semantic labels and position vectors, and the loss is calculated between the reconstructed and actual pixel labels. The training phase can be summarized into two essential steps. The first step is to assign pixel-to-superpixel relationships using the association map Q and compute the superpixel center attributes. We take into account the initialization grid size in this step, as follows:
\begin{equation*}{\mathbf{l}}({\mathbf{s}},d) = \frac{{\sum\nolimits_{{\mathbf{p}}:{\mathbf{s}} \in \mathcal{N}_{\mathbf{p}}^d} {\operatorname{avgpool} } \left( {{\mathbf{f}}({\mathbf{p}}) \odot {q_d}({\mathbf{p}},{\mathbf{s}})} \right)}}{{\sum\nolimits_{{\mathbf{p}}:{\mathbf{s}} \in \mathcal{N}_{\mathbf{p}}^d} {{q_d}} ({\mathbf{p}},{\mathbf{s}})}}\tag{2}\end{equation*}
Here, d denotes the grid size, with d = 16 and d = 32 being used in our implementation. The s represents the superpixel.
The proposed CDNet adopts a geodesic distance-based seed initialization strategy to learn two-scale pixel-to-superpixel association maps Q and merge them to obtain the final segmentation result.
In the second step, pixel attributes are reconstructed using the superpixel center attributes and the association map, as follows:
\begin{equation*}{{\mathbf{f}}^\prime }({\mathbf{p}},d) = \sum\limits_{{\mathbf{s}} \in \mathcal{N}_{\mathbf{p}}^d} {{{\operatorname{inter} }_{d,{\text{nn}}}}} ({\mathbf{l}}({\mathbf{s}},d)) \cdot {q_d}({\mathbf{p}},{\mathbf{s}})\tag{3}\end{equation*}
Here, interd,nn denotes nearest neighbor interpolation with a step size equal to d. The f′(p, d) is obtained by reconstructing the pixel labels from the association map.
To group pixels with similar attributes, we use cross-entropy loss, and to enhance spatial compactness within superpixels, we apply the ℓ2 norm. The overall network training loss comprises two components: cross-entropy (CE) loss for the semantic label and L2 reconstruction loss for position vector. The overall loss function ℒ is formulated as follows:
\begin{equation*}\mathcal{L} = \sum\limits_p C E\left( {{{\mathbf{f}}^\prime }({\mathbf{p}},d),{\mathbf{f}}({\mathbf{p}})} \right) + \frac{m}{d}{\left\| {{\mathbf{p}} - {{\mathbf{p}}^\prime }} \right\|_2},\tag{4}\end{equation*}
Experiment
A. Datasets and Implementation Details
To evaluate the efficiency of our approach, experiments were performed on BSDS500 [18] and NYUv2 [19] datasets. BSDS500 includes 200 training, 100 validation, and 200 test images, yielding 1,087 training, 546 validation, and 1,063 test samples. The NYUv2 dataset, aimed at indoor scene understanding, comprises 1,449 annotated images. Following previous works, unlabeled boundary regions are excluded, and a subset of 400 test images (608×448 pixels) is selected for superpixel evaluation.
For the training phase, input images from the BSDS500 dataset are randomly cropped to 224×224 pixels. The initial learning rate is set to 8e-5, and it is halved after 8,000 iterations. Using the Adam optimizer, the model is trained for 200 epochs with a batch size of 8. All experiments are implemented within the PyTorch framework on a workstation equipped with an Intel Core i7 CPU and an RTX 2080 GPU.
B. Comparison with the State-Of-The-Arts
We assess the performance of superpixel methods using three metrics: achievable segmentation accuracy (ASA), boundary recall and precision (BR-BP), and compactness (CO). The ASA score measures the upper limit of segmentation accuracy achievable with superpixels as a pre-processing step, whereas BR and BP evaluate the superpixel model’s ability to identify semantic boundaries effectively. CO assesses the compactness of the superpixels. Therefore, the BR-BP and CO metrics can effectively evaluate the adherence of superpixels in non-flat regions and the compactness of superpixels in flat regions. A comprehensive performance comparison on BSDS500 and NYUv2 datasets is illustrated in Figures 4-5.
(1) Results on BSDS500
As shown in Figure 4, we compare CDNet with other superpixel segmentation methods on the BSDS500 dataset using metrics such as ASA, BR-BP, and CO. The results demonstrate that CDNet outperforms other methods, particularly when the number of superpixels is large. This is likely because methods based on uniform superpixel initialization strategies tend to produce unnecessary segmentation in flat regions when the super-pixel count is high. Although CDNet’s performance in segmentation accuracy is slightly inferior when the number of superpixels is low, it still remains comparable to other deep learning-based methods such as AINet. Overall, the proposed CDNet, which employs a divide-and-conquer approach based on image content, is capable of generating appropriate superpixel segmentation results for both flat and non-flat regions.
(2) Results on NYUv2
Figure 5 depicts the comparison results of CDNet on the NYUv2 dataset without any fine-tuning. Similarly, the proposed CDNet outperforms other superpixel segmentation algorithms in all three metrics. The performance on NYUv2 shows no significant correlation with the number of superpixels. We attribute this to the NYUv2 dataset being primarily used for indoor scene understanding, where the presence of more diverse objects results in larger non-flat regions.
Qualitative results of four SOTA superpixel methods, SCN, AINet, BINet and our method. The top row displays the results for the BSDS500 dataset, the bottom row shows the results for the NYUv2 dataset.
To visually demonstrate the performance of the proposed CDNet across different image domains, we present the qualitative comparison results with three other state-of-the-art methods, including SCN [13], AINet [14], and BINet [15], on three datasets in Figure 6. In each image, we highlight two distinct regions: flat areas and detailed non-flat areas. It is evident that our model avoids unnecessary segmentation in flat regions while retaining the extraction of small target contours in non-flat regions. Overall, our model clearly produces appropriate segmentation for different image areas based on the image content.
Conclusion
To address the challenges of deep learning-based superpixel segmentation stemming from SLIC’s uniform grid initialization—balancing superpixel compactness in flat regions and boundary adherence in non-flat regions—we propose a content-aware dynamic segmentation network inspired by human visual saliency. Our approach introduces geodesic distance-based seed initialization and dual segmentation heads at different scales for joint training, enabling multi-scale superpixel segmentation within the same image. Experiments on BSDS500 and NYUv2 datasets show that our method achieves state-of-the-art performance with efficient inference.