Journals & Magazines >IEEE Access >Volume: 11

Elevating Amodal Segmentation Using ASH-Net Architecture for Accurate Object Boundary Estimation

ASH-Net: A Novel Single-Stage Architecture for Amodal Segmentation

Abstract:

Amodal segmentation is a critical task in the field of computer vision as it involves accurately estimating object boundaries that extend beyond occlusion. This paper int...Show More

Metadata

Abstract:

Amodal segmentation is a critical task in the field of computer vision as it involves accurately estimating object boundaries that extend beyond occlusion. This paper introduces a network named after the Amodal Segmentation Head, ASH-Net, a novel architecture specifically designed for amodal segmentation. ASH-Net is comprised of a ResNet-50 backbone, a Feature Pyramid Network middle layer, and an Amodal Segmentation Head. The evaluation encompasses three diverse datasets, namely COCOA-cls, KINS, and D2SA, providing a comprehensive analysis of ASH-Net’s capabilities. The results obtained demonstrate the superiority of ASH-Net in accurately estimating object boundaries beyond occlusion across multiple datasets. Specifically, ASH-Net achieves an Average Precision of 62.15% on the COCOA-cls dataset, 72.58% on the KINS dataset, and an impressive 91.4% on the D2SA dataset. Through extensive evaluation using average precision and average recall metrics, ASH-Net exhibits exceptional performance compared to state-of-the-art models. These findings highlight the remarkable performance of ASH-Net in overcoming occlusion challenges and accurately delineating object boundaries. This research identifies optimal training parameters, such as coefficient dimensions and aspect ratios, that significantly enhance segmentation performance while maintaining computational efficiency. The proposed ASH-Net architecture and its performance pave the way for improved object recognition, enhanced scene understanding, and the development of practical applications in various domains.

ASH-Net: A Novel Single-Stage Architecture for Amodal Segmentation

Published in: IEEE Access ( Volume: 11)

Page(s): 83377 - 83389

Date of Publication: 03 August 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3301724

Contents

SECTION I.

Introduction

Computer vision involves extracting information from visual data and has numerous applications in diverse fields such as autonomous driving [1], [2], artificial intelligence in physical systems [3], [4], and interpretation of visual content [5], [6], [7], [8]. Object segmentation, a fundamental task in computer vision, involves identifying and delineating objects within an image [9], [10], [11]. Accurate object segmentation is crucial for enabling machines to perceive and interpret their surroundings accurately. Traditional instance segmentation methods have made significant strides in segmenting visible parts of objects [12], [13], but they often fall short when it comes to handling occlusion and inferring the complete shapes of objects [14], [15].

Amodal instance segmentation is a promising solution to overcome the limitations of traditional methods [16]. Amodal segmentation refers to the process of segmenting objects in a visual scene, not only based on their visible parts but also inferring their complete shapes, including occluded or partially visible regions. The concept of amodal segmentation recognizes that objects in the real world exist in three-dimensional space and often interact with their surroundings, resulting in occlusion or partial visibility. In contrast to conventional instance segmentation methods that concentrate solely on visible regions, the objective of amodal instance segmentation is to detect and segment both the visible and occluded components of objects [17], [18]. This approach goes beyond traditional techniques by encompassing the complete understanding of objects, including their obscured portions. The goal is to try and provide a comprehensive understanding of object shapes, even in challenging scenarios where objects are partially occluded or only partially visible. By incorporating information about occluded regions, amodal instance segmentation offers a more holistic perception of the scene [19]. By considering occluded regions, the process of segmenting an image provides a better understanding of object shapes and boundaries, which is vital for computer vision applications [20], [21].

Achieving accuracy in amodal instance segmentation tasks, however, poses significant challenges. The presence of occlusion introduces complexities that traditional methods struggle to handle effectively. Occluded regions may contain crucial information for understanding the context, relationships between objects, and accurate object boundaries [22]. Therefore, there is a growing need for advanced techniques that can robustly handle occlusion and provide precise amodal instance segmentation [23].

The importance of Amodal Segmentation lies in its ability to provide exact dimensions of occluded objects, enabling a more comprehensive understanding of the scene and improving object localization accuracy. It offers a higher level of granularity in perceiving moving objects for Advanced Driver Assistance Systems (ADAS) tasks, leading to enhanced safety and efficiency in autonomous vehicles. It also is crucial in enhancing robotic perception, enabling robots to interact more effectively with their environments and perform complex tasks with improved object recognition and tracking capabilities.

Our paper introduces an innovative framework and methodology for amodal instance segmentation. The proposed architecture and technique in our paper present a novel approach to effectively address the challenges associated with accurately detecting and segmenting both visible and occluded parts of objects. Our approach addresses the limitations of traditional methods by leveraging a backbone layer, a feature pyramid network (FPN), and an amodal segmentation head. The proposed architecture enables the integration of both fine-grained local details and broader contextual information at a global level, effectively handles occlusion, and infers complete object shapes. By incorporating occluded regions into the segmentation process, our technique offers a more comprehensive understanding of objects in complex scenes. Through extensive experiments, we substantiate the superiority of our proposed architecture over existing methods. In the next section of this paper, we present a detailed overview of related work in the field, describe the proposed architecture and technique, present experimental results and analysis, and conclude with future research directions. By addressing the challenges associated with occlusion and providing robust amodal instance segmentation, we believe our research contributes to advancing the field of computer vision and its practical applications.

The key contributions of our paper can be summarized as follows:

Introducing ASH-Net, a novel single-stage architecture that achieves reduced computational cost and faster inference time, while maintaining enhanced generality and robustness.
Conducting extensive benchmarking on diverse datasets, such as KINS, COCOA-CLS, and D2SA, demonstrating superior accuracy levels compared to existing approaches.
Performing additional experiments to explore various augmentations and loss functions, resulting in improved accuracy and further validating the effectiveness of our proposed model.

SECTION II.

Related Work

The related work section aims to provide a comprehensive survey of the existing research and advancements in the field of amodal segmentation. This section presents summaries and analyses of key papers that have contributed significantly to the understanding and development of amodal segmentation techniques. Table 1 complements the text by highlighting the key comparative features of the models and metrics utilized across the papers surveyed.

TABLE 1 Summary of Various Amodal Segmentation Approaches

This work by Back S et al. [24] addresses the problem of completely occluded objects in unstructured environments, which requires the detection of observable and amodal masks alongside occlusions of concealed object instances. The authors introduce a novel modeling scheme that incorporates a hierarchical approach to feature fusion and prediction order. By assigning a hierarchy to the fusion of features and the order of predictions, the proposed scheme enhances the model’s ability to effectively handle occlusion and improve overall performance. The proposed UOAIS-Net is trained on synthetic RGB-D images and evaluated on three standards. The proposed architecture demonstrates effective amodal instance segmentation and occlusion modelling for unseen objects in cluttered scenes. Two datasets (UOAIS-Sim and OSD-Amodal) are introduced in this paper. The model reports the occasional over-segmentation of occluded objects and authors highlight the need for additional algorithms for amodal perception.

In their paper, Xiao et al. [25] introduce a novel approach for amodal segmentation, which involves inferring both the visible and invisible regions of an object’s mask. Unlike existing methods, their technique leverages the concept of amodal perception, where human observers can infer the complete object based on partial visibility. The authors address the challenges of simulating amodal perception by proposing a three-module framework: coarse, visible and amodal mask segmentation. By iteratively refining the visible mask based on the information from the amodal mask and leveraging shape priors, their method achieves improved accuracy in the segmentation task. Experimental evaluations on multiple datasets demonstrate the superiority of their approach compared to state-of-the-art methods. They have also performed visualizations on items such as carrot, bottle, and cucumber to highlight the interpretability and effectiveness of the category-specific shape priors.

Reddy et al. [26] introduced a new approach for learning amodal representations from time-lapse imagery. To facilitate their research, the authors create the WALT dataset, which consists of data captured by 12 cameras over a year. The authors demonstrated that combining object boundaries with segmentation masks further improves accuracy and applicability and efficiency of amodal segmentation across diverse camera setups and scene contexts. Their method showed robustness to occlusions and outperforms previous techniques, especially in scenarios with severe occlusions. This paper emphasized the potential limitations associated with generalizing amodal segmentation to new cameras that exhibit substantially different scenes. It underscored the need for further investigation into strategies for accelerating the learning rate to address this challenge effectively.

Ke et al designed the BCNet [27] that merges an occlusion perception branch and a bilayer GCN structure to perform amodal segmentation. Its approach explicitly models occluding regions and interactions between objects, leading to significant improvements in segmentation performance. Unlike methods that neglect occluding instances and overlapping relations, BCNet incorporates a perception branch for occlusion that is parallel to the target prediction pipeline and a bilayer convolutional graph to explicitly model occluder-occludee relations and consider interactions between objects. BCNet exhibited superior occlusion handling ability on highly overlapping images, surpassing Mask Scoring R-CNN and benefiting from training on a synthetic occlusion dataset. Qualitative evaluations demonstrate BCNet’s effectiveness in decoupling occluding and occluded instances and accurately regressing contours and masks.

Liu [28] introduces a novel approach that extends the amodal instance segmentation method using segmentation tracking without boundaries. The work identifies three issues with current models: the struggle in detecting large-scale objects, the failure to capture meaningful embeddings for shapes with similar appearances, and the ineffective usage of the Kalman filter due to objects bouncing back into the scene. The proposed model shows promise in accurately detecting masks for shapes and preserving their identities even when they are barely visible, but, faces challenges in distinguishing heavily overlapping shapes and requires improvements in handling complete occlusion and capturing more complex features in embeddings.

Sun et al. presented a Bayesian approach [30] for amodal instance segmentation promising a more data-efficient and robust computer vision model that addresses challenges that come with distributions that are out-of-task or out-of-distribution. They claim to outperform existing weakly and completely supervised techniques for scenarios with high occlusion levels. The contributions of this work included formulating amodal segmentation as a generalization problem, developing a Bayesian generative model trained on labelled bounding boxes, and achieving superior performance on various datasets. The paper introduced the concept of shape priors to boost the precision of amodal segmentation. The results demonstrated the effectiveness of the proposed approach both quantitatively and qualitatively, showcasing its ability to accurately estimate amodal segmentation and preserve object shape consistency even in the presence of occluders. Evaluation of model transferability showed robustness in performance when applied to previously unseen occluders. The authors recommend incorporating 3D shape priors for better representation of non-rigid objects.

Zhang et al. [31] introduced a novel approach of computing distance maps that are aware of semantics around it for every detected object in an image. These maps provide pixel-level information about the object’s modal and amodal masks along with information about their comparative depth ordering. The semantic layering network employs a convolutional neural network architecture consisting of multiple modules to compute the model depth maps. The results indicate that using semantic distance map representation improves the network’s ability to perceive amodal concepts. However, the model performs poorly challenges in generating high-quality object masks. Using pixel-level proposal frameworks may enhance the performance of the model in producing amodal segmentation masks.

Tran et al. introduces the AISFormer [32] framework that leverages a detection-transformer on a convolutional back-end architecture. A transformer-based head explicitly computes similarity between multiple computed masks within a single object’s region of interest. The feature encoding module captures and encodes relevant features from the input data. The mask transformer decoder module generates refined masks based on the encoded features. The invisible mask embedding module incorporates occlusion information into the segmentation process. Finally, the segmentation module performs the final segmentation of the objects of interest. Collectively, these modules form a cohesive framework that enables accurate and robust object segmentation. Ablation studies validated the efficacy of the projected learnable queries and invisible mask embedding. This model showcased potential for integrating shape priors and exploring other modalities like time-lapse and videos.

Nguyen et al. developed ASBU [33], a weakly supervised model with an estimation of boundary uncertainty for amodal segmentation. Leveraging the UNet for amodal instance segmentation, the work first addressed a limitation in previous work by swapping the mask with the occlusion boundary as the input to UNet. This allowed partial occlusion and reduced the intricacy of approximating the pseudo-ground-truth mask. This eliminated the generation of an object ordering graph, saving computational resources. Secondly, shape priors were estimated by computing an uncertainty map for the predicted maks. The uncertainty map helped regularize learning by modulating the intersection between the prediction mask and ground-truth mask of the occluded object based on regions of high uncertainty. The model’s pseudo amodal masks showed comparable performance to Mask-RCNN trained on ground-truth amodal masks.

Mohan et al., in their paper, takes a dive into amodal panoptic segmentation [34], addressing the need for comprehensive scene recognition that involves segmentation of both observable and occluded segmentations. This paper extends two urban driving datasets, KITTI-360-APS [35] and BDD100K-APS [36], by providing amodal panoptic segmentation labels, and creating benchmark datasets for evaluation. They propose the APSNet architecture that has two instance segmentation heads for task-specific and amodal segmentation connected to a shared backbone. A fusion module generates the panoptic segmentation output. The amodal instance head incorporates visible and occlusion features to predict the amodal mask of the object and enhance occlusion awareness. Custom evaluation metrics are introduced to quantify amodal segmentation performance of different object classes. The paper performs well on amodal scene parsing and works to address boundary segmentation in occluded regions.

SECTION III.

Methodology

This section presents the methodology employed in our research, focusing on the development of a novel architecture. To address the challenge of the amodal segmentation problem, we have designed a novel architecture (represented in Figure 1) that leverages the strengths of known deep learning architecture with incorporating novel design choices. The entire algorithm for the methodology has been illustrated in Algorithm 1.

FIGURE 1.

Architecture of proposed ASHNet with RESNET50 backbone, FPN layer and Amodal segmentation head.

Show All

SECTION Algorithm 1

ASHNET Object Detection and Segmentation Algorithm

Input:

Object Detection and Segmentation Dataloader D, ground truth labels y, ASHNet model is initialized with weights w, total number of epochs E, Stochastic Gradient Descent Optimizer O

Output:

Final bounding box and segmentation masks from the model

E $\leftarrow$ 500;

for epoch = 0 to E do

for batch in D do

// batch contains input image I and ground truth labels y

1, y $\leftarrow$ batch

// use R50_backbone to extract the feature maps F

F $\leftarrow$ RS50_backbone(l)

// Pass the features to FPN which results fine grained features

Fi $\leftarrow$ FPN (F)

// Pass the P3 layer FPN features to the Proto head

P_H which predicts a set of k

prototype masks M for the entire image.

Proto $\leftarrow ~\text{P}_{\mathrm {H}}$ (P3)

// Pass the FPN features to the prediction module PM which predicts c class confidences, 4 bounding box regressors and k mask coefficients, one corresponding to each prototype.

C, B, Coef $\leftarrow$ PM(F)

// Calculate the Cross Entropy Loss $\text{L}_{\mathrm {cel}}$

$\text{L}_{\mathrm {cel}} \leftarrow$ CE (C, Y_class)

// Calculate the Box Loss L_box, $\text{L}_{\mathrm {box}} \leftarrow$ SmoothL1 (B, y_box)

// Calculate the Mask Loss $\text{L}_{\mathrm {mask}}~\text{L}_{\mathrm {mask}} \leftarrow$ BCE (Proto, Coef, y_mask)

Apply the Optimizer O calculate the gradients

end for

A. Architecture

1) ResNet Back Bone

The initial step in our methodology involves employing the ResNet50 architecture [37] as the backbone layer to process the input image and extract meaningful features. This architecture is chosen for its remarkable depth and performance, making it a popular choice in various computer vision tasks. It is particularly constructed to address the challenge of training deep neural networks by mitigating the vanishing gradient problem. Residual networks incorporate skip connections, also known as residual connections, which alleviate the issue of vanishing gradients and enable the effective flow of gradients during training. These connections allow the model to learn identity mappings, ensuring that the higher layers can at least match the performance of the lower layers. This promotes the overall stability and convergence of the network.

By leveraging skip connections, the model can capture and preserve fine-grained details across multiple layers, allowing for a more comprehensive representation of the input image. This ability is particularly important in object segmentation tasks, where precise localization and understanding of object boundaries are crucial.

The depth of the architecture enables it to learn hierarchical representations of visual features, progressively extracting more abstract and high-level features as the network goes deeper. This hierarchical feature extraction process is beneficial for object segmentation, as it enables the model to capture both local details and global contextual information. By examining features at different levels of abstraction, the network can effectively differentiate between different objects and accurately segment them.

The features generated by the backbone serve as a rich representation of the input image, capturing important visual cues necessary for subsequent stages of the amodal instance segmentation process. These features form the foundation upon which further processing and refinement techniques, such as the integration of multi-scale features in the further layers to achieve more precise and comprehensive object segmentation results.

2) FPN

After obtaining the features at different scales from the backbone layer, the next step in our methodology involves utilizing the Feature Pyramid Network (FPN) [38] to generate feature maps with consistent channel dimensions but varying spatial resolutions. This layer, specifically designed as a feature extractor to cater to the demands of pyramid-based concepts, prioritizes both accuracy and speed. It enhances the feature extraction process by generating several feature map layers. These feature maps contain more comprehensive and informative data compared to regular feature maps, enabling the subsequent stages of our amodal instance segmentation approach to make more precise and context-aware predictions. This is achieved by employing a combination of up-sampling and lateral connections. The feature maps from the backbone layer have a greater resolution, which are spatially finer but semantically stronger, are up-sampled to match the those of the coarser feature maps higher levels in the pyramid structure. The spatial resolution, using the nearest neighbor method, is scaled up by a factor of 2 via the up-sampling process.

As depicted in the architecture, this layer also establishes lateral connections to the backbone layer. This is performed to merge feature maps that are of the same spatial size from both the backbone and the upper layers of the FPN. To ensure compatibility of feature maps for merging, those from the backbone pathway undergo $1\times 1$ convolutions, reducing their channel dimensions. Subsequently, the feature maps from the backbone and the top FPN layers are merged through element-wise addition. The merging process enhances the overall representation capability of the feature maps by combining the local and global context captured at different scales. By fusing information from multiple levels of abstraction, FPN facilitates the detection and segmentation of objects with varying sizes and complexities, enabling our amodal instance segmentation architecture to attain greater performance and accuracy. Integrating the FPN structure in our methodology empowers the amodal instance segmentation process by generating multi-scale feature maps with consistent channel dimensions and varying spatial resolutions. This allows for the incorporation of both fine-grained details and contextual information, enhancing the overall performance and robustness of our architecture in accurately segmenting objects across different scales and occlusion scenarios.

3) Amodal Segmentation Head

The next layer is the Amodal Segmentation Head (ASH), which is a crucial, novel component of our architecture that facilitates the generation of accurate and comprehensive amodal instance segmentation results. It consists of two main branches: the prototype generation branch (protos) and the detection branch, each playing a specific role in the overall prediction process.

The prototype generation branch (protos) is responsible for predicting a set of prototype masks that include the entire image. These prototype masks serve as foundational representations of object instances present in the scene. To achieve this, the protos branch is implemented as a Fully Convolutional Network (FCN) with the final layer comprising n channels. The FCN is attached to a specific feature layer extracted from the backbone network. By leveraging the protos branch, our architecture is able to generate prototype masks that capture the essential shape and structure of objects, including occluded or partially visible regions. These prototype masks provide a valuable foundation for subsequent stages of the amodal instance segmentation process, enabling the accurate inference of complete object shapes and boundaries.

The detection branch has an extra head which produces a predicted vector generated by calculating the mask coefficients for every anchor. These mask coefficients encode the representation of each instance in the model space. By leveraging this anchor-based approach, our architecture is capable of precisely associating detected objects with their corresponding prototype masks, facilitating the extraction of detailed and contextually-aware amodal segmentations. In the detection branch, there are two prediction heads. The first prediction head is responsible for estimating confidences, providing a measure of the likelihood that an anchor corresponds to a particular object instance. This branch aids in accurately identifying and classifying objects within the scene. The second prediction head in the detection branch focuses on predicting four bounding box regressors. These regressors provide precise localization information, allowing for the precise delineation of object boundaries and facilitating more accurate amodal instance segmentation results. By employing these dual prediction heads within the detection branch, our architecture achieves a complete assessment of objects within the scene. The combination of confidence estimation and precise bounding box regression enhances the overall accuracy and reliability of the amodal instance segmentation process.

B. Loss Function

We have used a cumulative loss function in our amodal segmentation model as the combination of multiple individual loss terms enables the integration of various objectives and trade-offs into a unified optimization criterion. This approach helps improve the robustness, regularization, and overall performance of the segmentation model. The Loss function of the proposed model $\mathrm {L}_{\mathrm {ASHNet}}$ is shown below in Equation 1. The terms are subsequently introduced and explained in this subsection.

$\begin{equation*} \text {L}_{\mathrm {ASHNet}}=\mathrm {L}_{\mathrm {cel}}+\mathrm {L}_{\mathrm {box}}+\mathrm {L}_{\mathrm {bce}} \tag{1}\end{equation*}$ View Source

$\mathrm {L}_{\mathrm {cel}}$

is the cross-entropy loss, a widely used loss function for training classification models. It measures the dissimilarity between the predicted probabilities (

$p_{ij}$

) and the ground truth labels (

$y_{ij}$

) for each instance

$i$

and class

$j$

. The formula for the cross-entropy loss is shown in Equation 2:

$\begin{equation*} \mathrm {L}_{\mathrm {cel}}=\frac {-1}{\mathrm {N}}\sum \limits _{\mathrm {i}=1}^{\mathrm {N}} \sum \limits _{\mathrm {j}=1}^{\mathrm {M}} {\mathrm {y}_{\mathrm {ij}}\log \left ({\mathrm {p}_{\mathrm {ij}} }\right)} \tag{2}\end{equation*}$

View Source

In this formula,

$N$

represents the total number of instances in the dataset, and

$M$

represents the number of classes. The terms

$y_{ij}$

and

$p_{ij}$

denote the ground truth label and predicted probability, respectively, for instance

$i$

and class

$j$

The cross-entropy loss is optimal for several reasons. Firstly, it encourages the model to assign high probabilities to the correct classes ( $y_{ij}$ = 1) and low probabilities to the incorrect classes ( $y_{ij}$ = 0). By taking the negative logarithm of the predicted probability, the loss function amplifies the penalty for incorrect predictions, forcing the model to minimize the loss by adjusting its parameters. Secondly, the cross-entropy loss is well-suited for gradient-based optimization algorithms. It provides smooth and continuous gradients that facilitate efficient backpropagation and parameter updates during training. This enables the model to learn and converge towards better classification performance. ${L} _{\mathrm {cel}}$ with the inclusion of $p_{ij}$ in its formula is a robust and effective choice for training classification models. It encourages accurate class predictions, penalizes incorrect predictions, and provides a differentiable objective for optimization algorithms to improve the model’s performance over time. $1~\mathrm {L}_{\mathrm {box}}$ , the box loss, also known as Smooth L1 Loss, is a commonly used loss function for training models to predict bounding box coordinates. It measures the difference between the predicted bounding box regression ( $b_{ij}$ ) and the ground truth bounding box coordinates ( $g_{ij}$ ) for each instance $i$ and class $j$ . The Smooth L1 Loss is preferred over the traditional L2 Loss (Mean Squared Error) because it is less sensitive to outliers and provides more robust gradients during training. The formula for the Smooth L1 Loss is shown in equation 3.

$\begin{equation*} \mathrm {L}_{\mathrm {box}}=\sum \limits _{i=1}^{N} \sum \limits _{j=1}^{\mathrm {M}} SmoothL1{\left ({b_{ij}{-g}_{ij} }\right) } \tag{3}\end{equation*}$ View Source

In this formula,

$N$

represents the total number of instances in the dataset, and

$M$

represents the number of classes. The term SmoothL1(

$x$

) is defined as:

$\begin{align*} SmoothL1\left ({x }\right)=\begin{cases} 0.5x^{2},& if \left |{ x }\right | < 1 \\ \left |{ x }\right |-0.5, & {otherwise} \end{cases}\end{align*}$

View Source

$\mathrm {L}_{\mathrm {bce}}$

is mask loss, specifically the Binary Cross Entropy Loss (BCE Loss), a commonly used loss function that measures the dissimilarity between the predicted mask (

$m_{ij}$

) and the ground truth mask (

$y_{ij}$

) for each pixel (

$i$

) and class (

$j$

). The formula for the BCE Loss is shown in equation 4:

$\begin{align*} \mathrm {L}_{\mathrm {bce}}\!=\!\frac {-1}{\mathrm {N}}\!\sum \limits _{\mathrm {i}=1}^{\mathrm {N}} \sum \limits _{\mathrm {j}=1}^{\mathrm {M}} {\mathrm {y}_{\mathrm {ij}}\log {\left ({\mathrm {m}_{\mathrm {ij}} }\right)\!+\!(1-y_{ij})\log {\left ({1-\mathrm {m}_{\mathrm {ij}} }\right) }}} \\{} \tag{4}\end{align*}$

View Source

In this formula,

$N$

represents the total number of instances in the dataset, and

$M$

represents the number of classes. The terms

$y_{ij}$

and

$m_{ij}$

represent the ground truth and predicted mask values, respectively, for each pixel in the mask. The BCE Loss calculates the cross-entropy between these two values. The BCE Loss allows for efficient optimization by providing a gradient signal that guides the model’s learning process. The derivative of the BCE Loss with respect to the predicted mask

$m_{ij}$

can be easily computed, facilitating the backpropagation algorithm and enabling the model to update its parameters effectively. It also is well-suited for binary classification tasks, which is the case in amodal instance segmentation where each pixel is classified as either belonging to the object instance (1) or not (0). By utilizing the logarithmic function, the BCE Loss penalizes large errors more significantly, which encourages the model to focus on correctly predicting the foreground and background regions.

It provides a probabilistic interpretation of the predicted mask values. The loss encourages the model to output values close to 1 for pixels that belong to the object instance and values close to 0 for pixels outside the object. This helps in generating accurate and sharp masks that delineate the object boundaries effectively. As it enables efficient optimization through gradient computation, suits the binary classification nature of the task, and encourages the model to produce accurate pixel-level masks by penalizing errors based on their probabilities, it is our component of choice.

SECTION IV.

Experiments

A. Datasets

The performance of the model in amodal segmentation is evaluated on three separate datasets: the D2SA dataset [39], the KINS dataset [16], and the COCOA-cls dataset [40].

1) D2SA

The D2SA dataset is an extension of the D2S (Densely Segmented Supermarket) dataset, encompassing a wide range of 60 object categories. It comprises a total of 2000 training images and 3600 validation images of size $1920\times 1440$ . Amodal mask annotations are generated by overlapping multiple masks, resulting in comprehensive visual representations.

2) KINS

The KINS dataset is created by leveraging the KITTI dataset as its foundation. It consists of a substantial collection of 7474 training images and 7517 validation images of varied dimensions around $1224\times370$ . In contrast to the D2SA dataset, the amodal ground truth in the KINS dataset is meticulously annotated by hand. This dataset primarily emphasizes seven distinct categories that are pertinent to autonomous driving tasks.

3) COCOA-CLS

The COCOA-cls dataset includes 2476 training images and 1223 validation images. This dataset covers a wide spectrum of 80 categories, providing a diverse collection of objects and scenes for analysis and evaluation of size $640\times 880$ .

These datasets serve as valuable benchmarks for evaluating the performance of our amodal segmentation model across different contexts and domains.

B. Augmentation

Augmentation techniques play a crucial role in improving the performance of the segmentation model by providing a more diverse and representative training dataset. By incorporating variations in colour, orientation, scale, and geometric transformations, these techniques enhance the model’s ability to generalize and accurately segment objects in different conditions. We performed the below augmentation techniques on all three datasets:

Photometric Distort: This technique applies random changes to the color and contrast of the image. It helps in making the model more robust to variations in lighting conditions and enhances the model’s ability to handle diverse color distributions. By introducing variations in color, brightness, and contrast, photometric distortion can improve the model’s ability to segment objects accurately under different lighting conditions.

Random Mirror: This technique randomly flips the input image horizontally. By augmenting the dataset with mirrored versions of objects, it improves the model’s ability to generalize to different orientations. It helps the model learn to recognize objects regardless of their left-right orientation, making the segmentation predictions more consistent and robust.

Random Crop: This technique randomly selects a crop from the input image. By simulating different object scales and aspect ratios, it improves the ability of the model to handle various sized objects. The model learns object segmentations at different resolutions, making it more versatile in capturing objects with varying spatial extents.

RandAugment: This technique applies a random combination of image transformations to the input image. It increases the diversity of the input, allowing the model to acquire robust features and generalize better to different scenarios. It can handle geometric variations such as rotation, translation, and scaling, making the model more robust to object deformations and occlusions. By introducing diverse variations during training, RandAugment improves the ability of the model to segment objects accurately in real-world scenarios.

C. Evaluation Metrics

To evaluate the performance of our amodal segmentation model, we employ two key metrics: average precision (AP) and average recall (AR) [17], [31]. These metrics provide valuable insights into the model’s accuracy and completeness in detecting and segmenting objects.

AP measures the precision of object detection and segmentation by considering the overlap between predicted and ground truth masks. It takes into account the precision at different overlap thresholds, such as IoU (Intersection over Union). A higher average precision indicates better localization and segmentation accuracy.

AR, on the other hand, evaluates the model’s ability to recall and detect objects across different scales and categories. It measures the percentage of correctly detected objects compared to the total number of ground truth objects. A higher average recall indicates better object detection performance.

By utilizing both average precision and average recall, we gain a comprehensive understanding of the model’s performance in terms of both accuracy and recall. These metrics allow us to assess the model’s capability to accurately localize and segment objects while maintaining a high recall rate, which is crucial for tasks such as object recognition, scene understanding, and autonomous driving.

SECTION V.

Results and Discussion

In this section, we present the results of amodal segmentation using ASH-Net on three diverse datasets: COCOA-cls, KINS, and D2SA. We compare the performance of ASH-Net with other state-of-the-art architectures, including PCNet, ABSU, VRSP-Net, and Mask R-CNN. The input sizes and segmentation performance are tabulated for each dataset to facilitate a comprehensive comparison of precision and recall metrics.

ASH-Net, with a Resnet-50 backbone and an input size of $550\times 550$ , exhibits robust segmentation performance across the evaluated datasets. The evaluation metrics employed vary based on the specific characteristics of each dataset. To visually assess the quality of the amodal segmentations, we provide accompanying figures that depict the ASH-Net results alongside the corresponding ground truth segmentations.

Table 2 presents the amodal segmentation performance comparison of ASH-Net on the COCOA-cls dataset. The results reveal that our model achieves the highest precision and recall scores when compared to other architectures reported in the literature. These findings demonstrate the efficacy of ASH-Net in accurately delineating amodal boundaries. Figure 2 showcases the segmentation results obtained by ASH-Net on the COCOA-cls dataset, providing a visual representation of its performance.

TABLE 2 ASH-Net Amodal Segmentation Performance Comparison on the COCOA-CLS Dataset

FIGURE 2.

Amodal segmentation results on COCOA-cls dataset using ASH-Net.

Show All

Moving on to the KINS dataset, Table 3 illustrates the amodal segmentation performance comparison of ASH-Net with other state-of-the-art architectures. Consistently, ASH-Net outperforms the competing models in terms of both precision and recall metrics, showcasing its superior segmentation capabilities. Figure 3 further exemplifies the effectiveness of ASH-Net by visualizing the segmentation results achieved on the KINS dataset.

TABLE 3 ASH-Net Amodal Segmentation Performance Comparison on the KINS Dataset

FIGURE 3.

Amodal segmentation results on KINS dataset using ASH-Net.

Show All

Table 4 presents the amodal segmentation performance comparison of ASH-Net on the D2SA dataset. Again our model attains the highest precision and recall scores, indicating its superior performance compared to other architectures. Figure 4 showcases the amodal segmentation results achieved by ASH-Net on the D2SA dataset, providing a visual confirmation of its proficiency.

TABLE 4 ASH-Net Amodal Segmentation Performance Comparison on the D2SA Dataset

FIGURE 4.

Amodal segmentation results on D2SA dataset using ASH-Net.

Show All

Collectively, these results highlight the superior performance of ASH-Net in amodal segmentation across multiple datasets. Its robustness, as evidenced by the consistently high precision and recall metrics, underscores its efficacy in accurately capturing and delineating object boundaries.

Table 5 depicts the various choices for coef-dim during the training of the model on the KINS, D2SA, and COCOA-Cls datasets. After careful analysis, a coef-dim of 32 was selected due to its optimal performance and computational speed. Interestingly, comparing the metrics across different coef-dim choices, we observed no significant difference between coef-dims 32 and 64.

TABLE 5 Comparison of Coef-Dim Variations During Model Training on Three Different Datasets

Table 6 presents the different choices for aspect ratios during model training on the KINS, D2SA, and COCOA-Cls datasets. Three aspect ratios were chosen based on considerations of performance and computational efficiency. Notably, our analysis revealed that there is no substantial difference in the metrics between aspect ratios 3 and 5. As a result, an aspect ratio of 3 was selected to reduce the model’s complexity and enhance training efficiency.

TABLE 6 Comparison of Aspect Ratios During Model Training on Three Different Datasets

Table 7 provides a comprehensive component comparison based on augmentation techniques and loss functionalities, considering the KINS, D2SA, and COCOA-Cls datasets. Our findings indicate that incorporating additional augmentation techniques and utilizing specific loss functions led to improved segmentation performance. These results highlight the importance of employing advanced techniques during the training process to enhance the model’s performance.

TABLE 7 Comparison of Augmentation Techniques and Loss Functions During Model Training on Three Different Datasets

In the course of this research, certain implementation challenges were encountered that were essential to address for the successful development of our proposed model. Firstly, we aimed to design a real-time capable single-stage architecture that could effectively tackle the amodal segmentation problem while ensuring efficient inference without compromising accuracy. Secondly, we faced the challenge of training the model with fine-tuned hyperparameters using a limited dataset, which required careful experimentation and validation to achieve optimal performance. Lastly, the dataset exhibited class imbalance, posing further complexities in obtaining unbiased predictions. To overcome this, we employed suitable techniques to address the class imbalance issues and enhance the overall model performance.

SECTION VI.

Conclusion

Our work has yielded significant findings and made notable contributions to the field of amodal segmentation. Through extensive experimentation and evaluation, we have demonstrated the exceptional performance of our proposed ASH-Net architecture compared to existing state-of-the-art models. ASH-Net consistently outperforms these models, achieving higher precision, recall, and other evaluation metrics for amodal segmentation tasks.

Our findings emphasize the significance and potential applications of our proposed amodal instance segmentation method. Accurately estimating object boundaries beyond occlusion is crucial in various real-world scenarios. Even in challenging scenarios, our proposed method can enhance object detection, tracking, and scene understanding, leading to advancements in these fields. In our work, we have identified optimal training parameters that enhance the segmentation performance of ASH-Net without compromising computational efficiency. This knowledge contributes to the efficient training of amodal segmentation models and provides insights for future research in this area.

In addition to quantitative analysis, our qualitative evaluation of the segmentation results highlights the precision and robustness of ASH-Net in capturing object boundaries beyond occlusion. This further validates the efficacy of our proposed method and demonstrates its potential for real-world applications.

While our research has achieved promising results, there are limitations and opportunities for future enhancements. Further improvements can be made by exploring new loss functions, refining the architecture, and investigating the generalizability of ASH-Net across different datasets and domains. Additionally, the impact of our research extends to inspiring future studies in the field of amodal segmentation, encouraging researchers to develop more accurate and reliable algorithms for estimating object boundaries in occluded scenes.

In conclusion, our research presents a significant step forward in the field of amodal segmentation, introducing ASH-Net as a powerful and effective architecture. The exceptional performance, optimal training parameters, and potential applications of our proposed method contribute to the advancement of amodal segmentation research. We anticipate that our findings will inspire further research and development in this area, ultimately leading to improved object recognition, scene understanding, and real-world applications.

References is not available for this document.

Elevating Amodal Segmentation Using ASH-Net Architecture for Accurate Object Boundary Estimation

Abstract:

Metadata

Abstract:

Introduction

Related Work

Methodology