Journals & Magazines >IEEE Access >Volume: 12

An Improved VGG-19 Network Induced Enhanced Feature Pooling for Precise Moving Object Detection in Complex Video Scenes

Graphical abstract for the background subtraction model.

Abstract:

Background subtraction is a crucial stage in many visual surveillance systems. The prime objective of any such system is to detect local changes, and the system could be ...Show More

Metadata

Abstract:

Background subtraction is a crucial stage in many visual surveillance systems. The prime objective of any such system is to detect local changes, and the system could be utilized to face many real-life challenges. Most of the existing methods have addressed the problems of moderate and fast-moving object detection. However, very few literature have addressed the issues of slow moving object detection and these methods need further improvement to enhance the efficacy of detection. Hence, within this article, our significant endeavor involved identifying moving objects in challenging videos through an encoder-decoder architectural design, incorporating an enhanced VGG-19 model alongside a feature pooling framework. The proposed algorithm has various folds of novelties: a pre-trained VGG-19 architecture is modified and is used as an encoder with a transfer learning mechanism. The proposed model learns the weights of the improved VGG-19 model by a transfer-learning mechanism which enhances the model’s efficacy. The proposed encoder is designed using a smaller number of layers to extract crucial fine and coarse scale features necessary for detecting the moving objects. The feature pooling framework (FPF) employed is a hybridization of a max-pooling layer, a convolutional layer, and multiple convolutional layers with distinct sampling rates to retain the multi-scale and multi-dimensional features at different scales. The decoder network consists of stacked convolution layers projecting from feature to image space effectively. The developed technique’s efficacy is demonstrated against thirty-six state-of-the-art (SOTA) methods. The outcomes acquired by the developed technique are corroborated using subjective as well as objective analysis, which shows superior performance against other SOTA techniques. Additionally, the proposed model demonstrates enhanced accuracy when applied to unseen configurations. Further, the proposed technique (MOD-CVS) attained adequate efficiency...

Graphical abstract for the background subtraction model.

Published in: IEEE Access ( Volume: 12)

Page(s): 45847 - 45864

Date of Publication: 27 March 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3381612

Contents

SECTION I.

Introduction

Visual surveillance is an essential technique for safety purposes and has two key steps: foreground separation followed by tracking. However, for an effective surveillance system local change detection is the primary assignment. For the last several decades, local change detection from challenging video scenes has been an arduous task and one of the diligent research areas in a visual surveillance system. Foreground segmentation from the image frames has numerous applications: activity recognition [1], traffic supervision [2], industrial monitoring [3], underwater surveillance [4] etc. The moving object detection process can retain the moving objects from the background in a sequence of complex video scenes. Therefore, the aforementioned procedure can be considered a binary classification task in which the pixels corresponding to the background are eliminated, while the pixels resembling moving objects are retained. Foreground separation from the complex video scene is challenging due to the dynamic background, camera jitter, missing information, slowly moving objects, etc. Background subtraction (BGS) approach [5] is the prominent way to partition the foreground from the background. The moving objects of the image frame are separated from the background in this approach. In the last few decades, several researchers developed various techniques across the globe for BGS. However, these existing techniques are well performed for specific challenges only. Also, the effectiveness of these BGS techniques is based on manual parameter tuning and handcrafted features. This gives rise to concerns regarding the development of more efficient and resilient techniques for detecting moving objects. Deep learning frameworks have been instrumental in advancing computer vision applications over the years. Also, for moving object detection deep neural networks are extensively used today, as they can retain low, mid, and high levels feature [6], [7], [8]. Further, the efficiency of the deep neural networks can be enhanced by utilizing a transfer learning strategy.

Several drawbacks have been identified in the deep neural network architecture for local change detection. The incorporation of deep learning frameworks in visual surveillance intensifies the complexity of the system. It has been observed that as the depth of the layers increases, the complexity of the model also escalates. Furthermore, it has been noted that training the deep neural network necessitates a larger number of sample frames. Additionally, it is rare to come across an end-to-end model for local change detection in existing techniques.

Therefore, a remarkable deep learning architecture in the form of an encoder-decoder model that effectively addresses multiple challenges encountered in complex as well as slow-moving video scenes is developed. An improved version of pre-trained VGG-19 deep learning framework as the encoder is adopted in the proposed methodology. The initial two blocks’ weights were set as pre-trained weights, while the weights of the third block were trained specifically on challenging datasets, enhancing the model’s resilience. With a transfer learning strategy, the proposed VGG-19 deep neural network preserves the appropriate features for moving object detection. Subsequently, the feature maps obtained from the encoder are fed into the feature-extracting framework, where features are pooled across different scales along the depth dimension. This is accomplished through the utilization of a max-pooling layer, a convolution layer, and multiple convolutional layers with distinct sampling rates. The decoder network in the proposed scheme effectively projects the feature label to the pixel label.

Therefore, the MOD-CVS contributes in five main ways:

A first and unique attempt for detecting local changes in challenging video datasets comprising of moderate, fast, and slowly-moving objects is made in this work using a feature pooling framework with the improved version of VGG-19 encoder-decoder type architecture.
The proposed algorithm provides better accuracy in four datasets with diverse challenges including slowly-moving objects, moderately and fast-moving objects, Indoor and outdoor image sequences, dynamic backgrounds, camera jitter, night video, low frame rate, thermal, etc.
The proposed model utilized fewer samples to train and attained better accuracy without extracting the temporal information from the challenging video scenes against current SOTA approaches.
Incorporating a transfer learning mechanism in the suggested scheme makes the model to learn the weights efficiently and enhances the efficiency.
A selected number of blocks in the proposed VGG-19 architecture is used to make the model less complex compared to the existing deep neural networks.

The efficiency of the developed model is substantiated by its validation on benchmark data sets, specifically designed for slowly-moving object detection [9], [10], fast and moderate moving object detection like CD-Net 2014 dataset [11], wallflower dataset [12], and Star dataset [13]. To confirm our findings, the outcomes of the developed technique are compared to thirty-six SOTA techniques. To verify the efficacy of the developed technique, we conducted both visual and quantitative analyses, confirming its effectiveness.

The structure of the remaining sections in the article is as outlined below. Section II discusses the literature’s presentations concerning local change detection. The proposed model in depth with a graphical illustration is discussed in Section III. Section IV describes the empirical outcomes analysis and ablation study. Section V provides the article’s conclusions along with a glimpse into future work.

SECTION II.

State-of-the-Art-Techniques

One of the most widely investigated topics in the field of computer vision involves the detection of local changes utilizing the background subtraction technique. Many researchers have worked for decades to build robust background subtraction algorithms that can detect objects in motion for complex scenes including objects moving at a relatively low speed where the object motion is confined to a smaller region, objects moving at a relatively high speed where the subsequent frames have higher variation, variations in illumination, camera jitter, shadow, image captured at night time, low-frame rate, low contrast, low resolution, non-static background, etc. Taking into account the latest literature, the SOTA techniques are divided into two parts as follows:

A. SOTA Techniques for Slow Moving Object Detection

Slowly moving object detection deals with the process of identifying and tracking objects that are moving at a relatively low speed, where the subsequent frames have lesser variation. In most cases, the spatial motion of the object is confined to a small area. While there are various techniques and approaches to object detection, detecting slowly moving objects can present specific challenges due to their reduced motion and potentially smaller visual cues. The most commonly used techniques for slowly moving object detection are frame differencing (FD) [14], optical flow (OF) [15], background subtraction (BGS) [16], Feature-based methods (FB) [17], [18], machine learning-based approaches (ML) [19]. The choice of method is decided based on the specific application, the characteristics of the slowly-moving objects, and the available computational resources. Combination or adaptation of multiple techniques may also be required to achieve accurate extraction of slowly-moving objects followed by tracking in various scenarios [20]. The BGS is an effective technique for detecting fast and moderately moving objects in a scene. It provides accurate segmentation of the foreground in real-time with low computational cost when the object moves slowly in a relatively static background. However, this method is sensitive to lighting changes, limited to static background, finds it difficult to handle occlusion, and requires background modeling if the background is not available. Again, it fails to extract the slow-moving object due to limited spatial change of pixels in object area [21], [22], [23]. Moving object detection using OF method use the motion vectors of pixels to determine the direction and magnitude of movement. It is very accurate at identifying and following fast-moving objects. As such, this method is highly adaptable to variations in texture, lighting, and other factors, making it ideal for tracking objects in real-time in video surveillance applications. Nevertheless, it is not able to effectively handle occlusion and is sensitive to image noise. It does not provide depth information about the object being tracked. It may not work well for objects that are not moving or moving slowly as it relies on the movement of objects [15], [24], [25]. FD is a common method for detecting moving objects in a video sequence. It is a fast method that can operate in real time, making it suitable for surveillance systems. It provides a cost-effective solution and can even detect objects that are partially occluded, by comparing changes between frames. However, it is sensitive to noise and small changes, such as camera shake or changes in lighting, which produce false positives and affect the accuracy of the result. It only detects moving objects that differ from the background, making it unsuitable for detecting objects having a similar color or texture as that of background [26], [27]. The FB methods can handle challenging scenarios where the appearance of the object varies due to different lighting or complex backgrounds. These methods can reduce the computational burden and process video in real time by extracting specific image features. On the contrary, the performance of these methods degrades significantly, if the features are not robustly detected or the features are affected by noise or occlusion. They also often require fine-tuning or retraining when dealing with new object classes or motion characteristics. These methods primarily pay attention to low-level image features like edges, corners, or texture patterns, without explicitly incorporating high-level semantic information. As a result, they may not be able to differentiate between objects with similar low-level features, leading to errors in object detection or tracking [28], [29], [30].

B. SOTA Techniques for Moderately and Fast Moving Object Detection

Moderately and fast moving object detection deals with the process of identifying and tracking objects that are moving at a relatively high speed where the subsequent frames have higher variation.

Some of the SOTA ML and Deep-learning-based approaches have been discussed for moderate and fast-moving object detection in the literature. The object detection algorithm known as Single Shot Multi-Box Detector (SSD), introduced by Liu et al. [31] demonstrates efficient object detection in images by achieving a favorable balance between accuracy and speed. SSD applies non-maximum suppression (NMS) to filter out redundant bounding box prediction and produce the final set of object detection. The prime advantages of SSD are its simplicity, speed, and ability to detect objects at multi-scales. However, SSD sacrifices some accuracy for faster inference speed. It utilizes a predetermined set of anchor boxes to detect objects at various scales. Choosing the right scales and aspect ratios for these anchor boxes can be challenging. Objects that significantly deviate from these predefined anchor boxes may not be accurately detected. It can not also handle highly occluded objects. The faster R-CNN framework for object detection introduced by Ren et al. [32] become a popular and influential method in computer vision. This technique efficiently and accurately localizes objects with high precision using a region proposal network (RPN) to generate candidate object proposals. It allows the entire object detection system for end-to-end training towards optimizing the overall performance. However, it is more complex compared to previous object detection methods. It involves multiple components, including a region proposal network, a shared convolutional backbone, and an object-specific classifier. This complexity can make it more challenging to understand and implement. Lin et al. [33] introduced a novel loss function called Focal Loss, specifically designed for dense object detection tasks like object detection and instance segmentation. The Focal Loss addresses the issue of class imbalance and the overwhelming number of easy negative examples that can hinder the training of object detectors. Focal Loss introduces an additional hyper-parameter, called the focusing parameter, which controls the rate at which the loss is down-weighted for easy negatives. Choosing an appropriate value for this parameter requires careful tuning, and an improper setting can affect the performance of the model. Zhou et al. [34] propose a novel method for object detection called “Objects as Points”, which aims at achieving efficient real-time object detection. This method demonstrates impressive real-time performance, enabling fast object detection in videos and live-streaming applications. While the method achieves high accuracy in detecting objects, the localization accuracy may not be as precise as some other object detection methods that rely on bounding boxes. This limitation might affect tasks that require precise localization, such as object tracking or fine-grained object recognition. Hei Law and Jia Deng. present a novel object detection framework called CornerNet, which detects objects by treating them as paired key points. CornerNet represents objects as key points and models the object’s spatial information, which helps in precise localization and reduces false positives. This approach treats objects as key points. Hence, it may struggle with objects that have complex or highly variable poses. Since the model primarily focuses on detecting corners, it may not be as effective in cases where key points are not prominent or informative [35]. Zhu et al. [36] proposed Generative Adversarial Networks (GANs) which is a popular class of deep learning models used in generative modeling tasks, such as image synthesis and data generation. Images, audio, and text can all be produced by GANs in a realistic manner while still closely resembling the training set. This makes them valuable in various applications, including art generation, data augmentation, and synthetic data creation for training other models. These networks are prone to a phenomenon known as mode collapse, wherein the generator generates a restricted range of samples, thereby not capturing the full distribution of data. This results in generated samples that lack diversity and fail to cover all the modes present in the training data. MotionRec [37] is composed of a temporal depth reductionist (TDR) block, a motion saliency estimation (MoSENet) network, and regression and classification blocks. This represents the initial endeavor to concurrently localize and classify dynamic entities within a video referred to as MOR (Moving object recognition) using a unified deep learning framework in a single stage CDNet-2014 dataset. Unified frameworks may sometimes be computationally expensive. In AE-NE [38] is entirely unsupervised. It operates with a fixed set of hyperparameters, and the architecture of the autoencoder is dynamically determined based on image size and background complexity, devoid of manual supervision. The autoencoder is additionally trained to anticipate background noise, enabling the calculation of a pixel-dependent threshold for foreground segmentation in each frame. This model is ill-suited for processing night videos, as indicated by the low score it achieved in this category on the CDNet-2014 dataset. The model is not recommended for scenarios where the video is anticipated to depict substantial stationary objects over an extended duration. TSS [39] has made significant contributions to computer vision and video analysis. This method can learn hierarchical features from data, enabling them to discern patterns and variations in motion more effectively than traditional methods. It may be highly specialized and may not generalize well across different domains or environmental conditions. Fine-tuning or retraining might be necessary for optimal performance in diverse settings. A real time multiple object tracking [40] method is based on a modified version of deep simple online and real time tracking (Deep SORT) algorithm. Deep learning methods can handle a large number of objects and complex scenes simultaneously, making them suitable for tracking multiple objects in crowded environments. Training deep learning models for multiple object tracking requires large annotated datasets, which can be time-consuming and expensive to create, particularly for diverse scenarios. Jiawei et al. [41] proposed a 3D video object detection framework emphasizing enduring temporal visual correlation, termed BA-Det. BA-Det operates as a two-stage object detector, proficient in concurrently acquiring knowledge in object detection and temporal feature correspondence through the introduced feature metric object bundle adjustment (OBA) loss. The method exclusively concentrates on objects, such as cars, trucks, and trailers. The effectiveness of flexible objects like pedestrians has not been explored. Further, the related works on background subtraction techniques using explainable deep learning frameworks, outlined in a Table 1 while emphasizing the principal contributions, advantages and disadvantages.

TABLE 1 Summary of the Existing Background Subtraction Techniques Using Explainable Deep Learning Frameworks

It is found that all the above SOTA methods discussed in the literature related to slow moving object detection are capable of identifying the objects when the variation among the consecutive frames is much less. However, the SOTA schemes addressed for moderately and fast-moving object detection detect the object when there is a higher variation among the successive frames. Hence, from the above discussion, it may be concluded that a single method can not detect all types of moving objects at various speeds. This motivated us to develop a moving object detection framework using a VGG-19 architecture with structural modification-induced FPF to detect moving objects at slow, moderate, and fast speeds. In the proposed design the improved VGG-19 architecture can retain details at various levels. The proposed VGG-19 architecture-induced FPF module capable of preserving the details of objects at several scales and different speeds. The designed decoder architecture can effectively project features to image space. Section III of the paper focuses on the proposed methodology.

SECTION III.

The Proposed Algorithm

This article presents a unique and durable deep-learning model for foreground segmentation from a complex video scene for various challenging scenarios. Here, we have developed a deep learning model in which a modified VGG-19 network is used as an encoder integrated with a feature pooling framework (FPF) to effectively detect objects at diverse sizes from the video scenes. The FPF block can retain the sparse and dense features from image frames that are suitable for local change detection. The decoder network learns a mapping from the feature label into a pixel label effectively. Fig. 1 represents the developed network with the dimensions of each layer of the feature map in detail.

FIGURE 1.

Representation of the developed BGS model in the block diagram.

Show All

A. Encoder Network

The design in this work has improved the pretained VGG-19 network and adhered as an encoder network. A typical VGG-19 network is used for several image-processing applications. Nonetheless, the said framework has yet to be explored for foreground segmentation. Here, we have used the abilities of the VGG-19 network for foreground separation. The original VGG-19 network [46] has five blocks, each with stacked convolutional layers, and the activation function is the rectified linear unit (ReLU). Convolutional layers can retain the input image’s spatial information and the ReLU function in the proposed model activates the required neurons that boost the efficiency of the architecture.

The proposed model capitulated with an altered form of deep VGG-19 network, which comprises the starting three blocks. Where the weights of the first two blocks are the same as the weights of the original VGG-19 architecture [46], and the weights of the third block are accomplished by using the transfer-learning (T-L) strategy for the challenging dataset. T-L, as a mechanism, assimilates information from the input to the output domain. In the developed technique, applying the T-L strategy investigates novel tasks built upon the foundation of tasks previously learned by the original deep VGG-19 network. Also, the T-L strategy enhances the model’s speed and robustness, particularly when training on a limited number of samples. To optimize the utilization of high spatial resolution and frequency details, the fourth and fifth blocks of the original VGG-19 network have been omitted in the MOD-CVS. A detailed description of the altered VGG-19 deep learning model with dimensions of each layer of the feature map is presented in Fig. 2. The high spatial frequency features are retained at the first block of the encoder by using $3\times3$ convolution layers with 64 and 128 filters. It is found that the $3\times3$ kernel allows for the learning of hierarchical features. Also, the $3\times3$ convolutional layers with increased filter numbers (64 and 128) allow the network to capture intricate and diverse low-level features. This can enhance the model’s ability to discriminate between classes and improve its overall efficacy in tasks such as segmentation, object detection, or recognition. These fine-scale features are transferred for the decoder network via skip connections and global average pooling (GAP), which enhances the feature presentation.

FIGURE 2.

Detailed description of the altered VGG-19 deep learning model in the block diagram.

Show All

B. Feature Pooling Framework

To effectively preserve objects of different scales from challenging video scenes, this work presents a feature pooling framework (FPF) between the encoder and decoder networks which is shown in Fig. 3. Also, the dimensions of each layer of the feature map of the FPF module are shown in Fig. 3. The max-pooling layer is hybridized in the FPF module with 64, $1\times1$ filter size convolutional (conv.) layer, 64, $3\times3$ filter size conv. layer, and atrous conv. layers with different dilation rates of 4, 8, and 16, respectively. The suggested approach uses atrous conv. layers with a 64, $3\times3$ filter size. Atrous conv. layers are valuable in certain contexts for enlarging the receptive field without increasing the number of parameters or computational cost. They are crucial in complex scenarios to capture broader context information without significantly inflating the model’s complexity. The max-polling layer can retain the maximum information $\eta _{1}$ for taking window size $2\times2$ from the encoder output $\eta$ . The conv. layer and different atrous conv. layers of the FPF block, which are effectively represented as $\eta _{2}$ , $\eta _{3}$ , $\eta _{4}$ , and $\eta _{5}$ , can anticipate sparse and dense feature space from the high-dimensional feature space $\eta$ . Then, $\eta _{1}$ , $\eta _{2}$ , $\eta _{3}$ , $\eta _{4}$ , and $\eta _{5}$ features are concatenated along the channels and processed through contrast normalization (CN) followed by a spatial dropout layer with a rate of 0.25 to produce the FPF block output of 320 feature maps. Observations indicate that the proposed model demonstrates improved performance with the utilization of the CN layer instead of the batch normalization layer. Also, the choice of the dropout rate (0.25 in this case) is often based on experimentation and hyperparameter tuning. A rate of 0.25 implies that 25% of the features will be randomly dropped out during training, which is chosen to strike a balance between preventing over-fitting and allowing the network to learn from a variety of features. Additionally, the inclusion of a spatial dropout layer effectively preserves spatial information while simultaneously reducing redundant information.

FIGURE 3.

Proposed feature pooling framework.

Show All

C. Decoder Network

Spatial information of the complex video scene is essential for effective moving object detection. Therefore, the developed decoder network comprises a stack of conv. layers in the proposed model that preserve spatial information efficiently. The initial conv. layer consists of 64 filters with a $3\times3$ size, projecting the 240 feature maps obtained from the FPF block into 64 feature maps. These features are succeeded by the CN layer and the ReLU function is fused with the fine-scale features retained at the end of the first block of the encoder, followed by the GAP layer. The feature fusion is achieved using the coefficients obtained through the application of the GAP layer on the features that are extracted at the end of encoder BLOCK - 1 using $3\times3$ convolution layers with 128 filters perform element-wise multiplication (X) with the feature maps of the initial conv. layer of the decoder network. Subsequently, the resulting features are added (+) to the outputs of the initial conv. layer of the decoder network. Using the GAP layer in the decoder framework enhances the performance of the proposed model. Afterward, the fused features are Up-sampled and passed through the second conv. layer consisting of 64 filters with a $3\times3$ size followed by the CN layer and ReLU function to generate the 64 feature maps. Again these feature maps are fused with the fine-scale features extracted at the beginning of the first block of the encoder, followed by the GAP layer. The feature fusion is achieved using the coefficients obtained through the application of the GAP layer on the features that are extracted at the beginning of encoder BLOCK - 1 using $3\times3$ convolution layers with 64 filters perform element-wise multiplication (X) with the feature maps of the second conv. layer of the decoder network. Subsequently, the resulting features are added (+) to the outputs of the second conv. layer of the decoder network. The fused features are Up-sampled and projected into 128 feature maps by utilizing a third conv. layer consisting of 128 filters with a $3\times3$ size. It is observed that these features provide a better presentation of the object and background pixels and boost the performance of the developed model. Eventually, a final conv. layer contains 1 filter with a $1\times1$ size preceding a sigmoid activation function that accurately projects the feature space into image space. A threshold value of 0.9 provides the mask effectively for the corresponding RGB input image. It is found that the threshold value of 0.9 provides better accuracy for challenging video scenes.

SECTION IV.

Analysis of Simulation Based Experimental Results

The developed model is running on a Windows 10 operating system with 8GB RAM with Python programming. The proposed work is trained and tested over the NVIDIA Tesla T4 GPU given by the Google Co-laboratory pro version. The proposed work is implemented by utilizing the TensorFlow backend with the Keras library. The significance of the presented model is tested on the challenging data sets [9], [10], [11], [13], [12]. The efficiency of our developed algorithm is corroborated by resembling its results with the outcomes acquired by thirty-six SOTA techniques using subjective and objective analysis.

A. Parameter Settings and Training Details

A NVIDIA Tesla T4 GPU system with a batch size of 2 is used to train the model from beginning to end. The developed model’s reduced batch size can have a special regularisation effect and help the model converge more quickly. There are P pixels in each frame and N = 25 frames are used to train this model. Furthermore, we train the model using the binary cross entropy loss (BCEL) function. This compares each pixel’s actual and predicted class labels.

To train the proposed approach, we used the RMSProp optimizer with $\rho = 0.9$ and $\epsilon = 1e-08$ . Comparatively speaking to other traditional optimizers, this offers a faster convergence rate. The learning rate is initially set to 0.0001. The learning rate is subsequently scaled down by 10 if, after 5 consecutive epochs, the validation loss does not reduce. To train the model, we preserved a maximum of 100 epochs.However, if the validation loss did not decrease for ten consecutive epochs, an early stopping strategy was used. Sequential feeding of the training frames to the model could lead to biased learning weights. Because successive frames have a strong correlation with one another, this issue occurs. As a result, we randomly select the training frames to train the model initially. These frames are split into 20% for validation and 80% for training. To solve the issue of imbalanced data classification during model training, we provide the foreground class with more weights and the background class with fewer weights.

B. Subjective Analysis

For slow moving objects the visual demonstration of the detected results achieved by the existing techniques and our developed algorithm is presented in Fig. 4. Fig. 4 (a) and (b) depict the original frames and associated ground-truth images, respectively. The results obtained by the Badri et al. [47] technique are presented in Fig. 4 (c) where the said technique detected background pixels as foreground pixels for various slow moving image sequences. Fig. 4 (d) represents the detected outcomes obtained by the Zhu et al. [48] scheme where the missed alarm rate is high. Fig. 4 (e) and Fig. 4 (f) denote the outcomes achieved by the Sahoo et al. [20], and Sahoo et al. [49] techniques respectively, where a high false negative rate is observed. The outcomes attained by the developed model are illustrated in Fig. 4 (g) where the background and foreground pixels are classified accurately. Fig. 5 (a) and Fig. 5 (b) indicate the input images and the associated ground-truth frames. From Fig. 5 (g), it is evident that the developed technique accurately captured the moving object shape, demonstrating lower false negative and false positive rates against the Badri et al. [47], Zhu et al. [48], Sahoo et al. [20], and Sahoo et al. [49] existing techniques presented in 5 (c), (d), (e), and (f) respectively.

FIGURE 4.

Foreground segmentation for various sequences: (a) original frame (b) ground-truth image, outcomes attained by background subtraction technique based on (c) Badri et al. [47], (d) Zhu et al. [48], (e) Sahoo et al. [20], (f) Sahoo et al. [49] and (g) MOD-CVS.

Show All

FIGURE 5.

Show All

The change detection output is visually analyzed using seven sequences chosen from the CD-Net 2014 dataset. The challenging effects on video scenes include low contrast, non-static background, low frame rate, noise, shadow, poor resolution, low signal-to-noise ratio, lack of object shape and textural details in the images, etc. The developed technique’s performance is visually compared with that of six established deep learning methods, including BSUV-Net _SemanticBGS [50], BSUV-Net 2.0 [51], Cascaded CNN [52], DeepBS [53], Fast BSUV-Net 2.0 [51], WisenetMD [54]. Fig. 6 (a) and (b) represent input images and their associated ground-truth frames respectively. The object detection outcomes achieved by BSUV-Net_ SemanticBGS as demonstrated in Fig. 6 (c), where it can be seen that the background is identified as the foreground. Fig. 6 (d) represents BSUV-Net 2.0 [51] technique outcomes where numerous false alarms are present in the target scene. The segmented outcome of the Cascaded CNN [52] method is showcased in Fig. 6 (e), where the said technique is unable to detect a few information of the object in motion. Fig. 6 (f) shows outcomes of the DeepBS [53] method where numerous edge pixels are absent due to imbalanced pixel values across various video frames, this leads to a significant number of missed alarms in the detected outcomes. The outcomes of the Fast BSUV-Net 2.0 [51] technique are represented in Fig. 6 (g), where the mentioned technique incorrectly categorized certain pixels of an object as background. Fig. 6 (h) represents the WisenetMD [54] algorithm’s results, where this method encounters difficulty in discerning subtle variations in grey values, resulting in the generation of ghost. In contrast, the MOD-CVS showcased in Fig. 6 (i), gives better performance against the existing SOTA techniques as well as precisely classifying background and foreground accurately. In complex video scenes, the developed technique can successfully determine the shapes of moving objects. The MOD-CVS is further tested in wallflower datasets as shown in Fig. 7. Original frame and associated ground truth images are represented in Fig. 7 (a) and (b). Fig. 7 (c) illustrate the result attained by the developed technique. From Fig. 7 (c), it is found that the developed technique attained better results for wallflower dataset. Again, the MOD-CVS is validated on Star dataset. The input frame and their corresponding ground-truth image are presented in Fig. 8 (a) and (b). Fig. 8 (c) illustrates the proposed method’s results where it is noted that the MOD-CVS has the capability to accurately classifying the foreground as well as background pixels with lesser noise.

FIGURE 6.

Foreground segmentation for various sequences: (a) original frame (b) ground-truth image, outcomes attained by BGS technique dependent on (c) BSUV-Net _SemanticBGS [50], (d) BSUV-Net 2.0 [51], (e) Cascaded CNN [52], (f) DeepBS [53], (g) Fast BSUV-Net 2.0 [51], (h) WisenetMD [54], and (i) MOD-CVS.

Show All

FIGURE 7.

Foreground segmentation for various sequences:(a) original frame (b) ground-truth image, (c) outcomes attained by MOD-CVS for wallflower dataset.

Show All

FIGURE 8.

Foreground segmentation for various sequences: (a) original frame (b) ground-truth image, (c) outcomes attained by MOD-CVS for Star dataset.

Show All

C. Objective Analysis

To assess the efficacy of the developed technique, we have made a quantitative distinction between the developed technique and the prevailing SOTA techniques for slow moving objects including average F-measure (AF) and average miss classification error (AMCE) are outlined in Table 2, and Table 3. From these Tables, It is found that the developed algorithm attained a greater value of AF with a reduced value of AMCE against the Badri et al. [47], Zhu et al. [48], Sahoo et al. [20], and Sahoo et al. [49] contemporary methods.

TABLE 2 Comparison of Average F Measure in Percentage of Proposed Scheme With Different SOTA Techniques

TABLE 3 Comparison of Average Miss Classification Error (AMCE) in Percentage of Proposed Scheme With Different SOTA Techniques

To further justify the efficiency of this proposed algorithm, the developed model is tested on the CD-Net 2014 dataset with various challenging sequences, including averages for Precision (AP), Recall (AR), F-measure (AF), and Percentage of Wrong Classification (APWC). The objective is to simultaneously reduce the percentage of wrong classifications (PWC) and increase F-measure, Precision, Recall [59]. We compared the result obtained by CD-Net 2014 datasets against eighteen existing BGS SOTA methods, including eight deep learning techniques: DeepBS [53], WisenetMD [54], Fast BSUV-Net 2.0 [51], SemanticBGS [60], BSUV-Net [50], BSUV-Net + SemanticBGS [50], IUTIS-5 [61], and BMN-BSN [62]. Table 4 shows that the proposed model achieves superior values for AP, AR, and AF while exhibiting a lower APWC compared to all SOTA deep learning techniques. Also, the MOD-CVS compared with ten non-deep learning existing techniques: SWCD [63], CVABS [64], PAWCS [65], WiSARDrp [66], Multimode Background [67], BMOG [68], WeSAMBE [69], RT-SBS-v1 [70], M4CD Version 2.0 [71], and CL-VID [72]. In Table 4, it is evident that the developed technique shows higher values for AP, AR, and AF, while also presenting a lower APWC compared to SOTA techniques that are not based on deep learning.

TABLE 4 Quantitative Comparison of MOD-CVS on CD-Net 2014 Dataset With Different Deep Learning and Non-Deep Learning Based Methods

Further, to check the efficiency of the MOD-CVS, the experiment has been done on Star datasets, which consists of image sequences of challenging video scenes: noise in the video scene, non-static background, changes in lighting conditions, and shadow. we have compared with five SOTA techniques: GMM [73], DPGMM [74], Feature bags [75], Video plane [13], and Self-organizing [76]. We employed the average similarity measure [76] to assess the effectiveness of the developed technique. The average similarity measure attained by the proposed approach compared to different SOTA methods is shown in Table 5. The results in Table 5 indicate that the MOD-CVS exhibits higher accuracy in the average similarity measure on the Star datasets in comparison to other current SOTA techniques considered.

TABLE 5 Average Similarity Measure for Star Dataset (In This Table AP, BT, CA, CU, ES, FO, LO, and ST Indicates the Airport, Bootstrap, Campus, Curtain, Escalator, Fountain, Lobby, and Station respectively)

Eventually, to evaluate the efficacy of the developed model, a well-known Wallflower dataset is used for testing that contains indoor and outdoor video scenes captured by a CCD camera on a non-static background, illumination variations, and video noise. The effectiveness of the developed technique is validated through a comparative analysis with nine established SOTA techniques: Fuzzy Mode [77], ViBe [78], BRPCA [79], GMM [73], Codebook [80], DeepBS [53], Triplet CNN [81], MsEDNet [82], and STAM [83]. The evaluation metric employed for this database is AF. Analysis of Table 6 reveals that the proposed algorithm achieves the highest AF values compared to all the considered SOTA techniques.

TABLE 6 Average F Measure for Wallflower Dataset (In This Table BT, LS, CM, MO, WT, and TD Denotes the Bootstrap, Light Switch, Camouflage, Moved object, Waving Tree, and Time of Day)

D. Unseen Video Setup

In an unseen video arrangement, the training, as well as the testing set, contains different videos. The proposed framework is trained with the Claire, Mother daughter, and Grandma image sequences, and for testing Akiyo, Teleprompter, and Speech image frames are used. Similarly, the model is trained using the Salesman, Teleprompter, and Speech image sequences, and for testing Miss and Suzie image frames are used. From Table 7, it is observed that the designed model attained a better average F-measure value for the unseen setup. Similarly, we have investigated the efficacy of the MOD-CVS in unseen setup for the wallflower, and Star databases. Table 7 indicate that the developed model exhibited better AF values for the wallflower and Star databases in an unseen configuration. Additionally, the effectiveness of the developed algorithm is assessed in unseen setups for the CD-Net 2014 dataset. As shown in Table 8, the proposed model demonstrated satisfactory accuracy compared to established BGS techniques. In this table, Bl, Pe, Sw, Bo, Pa, Tp, Ts, Bs, Co, and T1 depicts the blizzard (from BadWeather), pedestrian (from Baseline), sidewalk (from Camera Jitter), boats (from Dynamic Background), parking (from Intermittent Object Motion), turnpike05fps (from Low Framerate), tramstation (from Night Videos), busstation (from Shadow), corridor (from Thermal), and turbulence1(from Turbulence), respectively.

TABLE 7 Average F-Measure of the MOD-CVS in Unseen Video Setup on Slow Moving Object, Wallflower, Star Dataset

TABLE 8 Average F-Measure Comparison of the MOD-CVS in Unseen Setup on CDNet-2014 Dataset With Different Techniques

E. Ablation Study

To analyse the importance of each element in the developed BGS deep-learning based framework, an ablation study is performed. Table 9 demonstrates the efficacy the developed algorithm with and without the GAP layer. It is found that, the inclusion of the GAP layer in the proposed model consistently yields a higher AF value when compared to the version without the GAP layer across all challenging videos. Likewise, an ablation study of the proposed architecture is conducted, exploring its performance both without and with the integration of a feature pooling framework (FPF). From Table 10, it is observed that the proposed algorithm with the FPF module is capable of attaining higher accuracy as compared to without the FPF module. The FPF module between the encoder and decoder effectively learns a mapping from high-dimensional feature to a multi-dimensional feature.

TABLE 9 Ablation Study of MOD-CVS on Slowly-Moving Data Sets Without and With Global Average Pooling in Terms of Average F Measure Comparison

TABLE 10 Ablation Study of the MOD-CVS on Slowly-Moving Data Sets Without and With Feature Pooling Framework in Terms of Average F Measure Comparison

Additionally, the ablation study culminated in a run-time comparison of the proposed approach compared to various SOTA techniques using the CDNet-2014 dataset. Table 11 reveals that the processing time of the developed architecture is 21 frames per second, underscoring the comparatively lower computational complexity of the MOD-CVS compared to many existing SOTA methods.

TABLE 11 Run-Time of Different Schemes on CD-Net 2014 Dataset

The proposed method is tested for effectiveness with k-fold cross-validation. The results of the MOD-CVS’s performance with k-fold cross-validation (k = 5 and k = 10) are outlined in Table 12. In this study, when k equals 5, we partitioned the entire set of 159,278 frames from CDNet-2014 dataset into 5 folds. Initially, we utilised the first fold for testing and the remaining folds for training during the training of the proposed model. Subsequently, the second fold served as the testing set, with the remaining folds employed for training, and this process continued iteratively. Similarly, k = 10 was used to train the suggested model, and 159,278 frames on CDNet-2014 are divided into 10 folds. Subsequently, testing is carried out with one fold while training is conducted with the remaining folds in a sequential manner. According to the data in Table 12, the suggested approach utilizing a k-fold cross-validation with values of k equal to 5 and 10 demonstrates outcomes with an average F-Measure of 0.8105 and 0.8157 for k = 5 and 10, respectively. However, the developed MOD-CVS technique without cross-fold validation training mechanism attains a higher value of average F-Measure equals to 0.8269.

TABLE 12 Assessment of the MOD-CVS’s Average F-Measure Via an Ablation Study, Utilizing a K-Fold Cross-Validation Training Approach on the Change CDNet-2014 Dataset

Further, the efficacy of the proposed MOD-CVS model is verified in Table 13 which illustrates a comparison of the average F-Measure between the proposed MOD-CVS and various Swin Transformer-based methods. It is found that the proposed MOD-CVS method got comparatively higher value than the existing methods. Furthermore, the consistency can be demonstrated for NJU2K [57], STERE [56], NLPR [58], and DUTS [55] datasets. Fig. 9 (a) and (b) depict the input frame and its corresponding ground-truth image, respectively. In Fig. 9 (c), the outcomes of the proposed method are portrayed, emphasizing the MOD-CVS’s proficiency in precisely categorizing both foreground and background pixels.

TABLE 13 Comparison of Average F-Measure of MOD-CVS With Different Swin Transformer Based Method (- Indicates Non-Availability of the Result)

FIGURE 9.

Foreground segmentation for various sequences: (a) original frame (b) ground-truth image, (c) outcomes attained by MOD-CVS for DUTS [55], STERE [56], NJU2K [57], and NLPR [58] datasets.

Show All

In Table 14 evaluation of the average F-Measure between the autoencoder-based AE-NE [38] approach and the proposed MOD-CVS method on CDNet-2014 dataset is verified. It is clearly demonstrated that the MOD-CVS exhibits a higher F-measure value in comparison to alternative methods.

TABLE 14 Comparison of Average F-Measure of Auto Encoder Based Method With MOD-CVS Method on CDNet-2014 Dataset

SECTION V.

Conclusion

This research work tackles the task of detecting moving objects in challenging video scenes by employing a deep-learning architecture with an encoder-decoder design. The proposed model detects moving objects in complex video scenes including objects moving at different speeds, low contrast, non-static background, low frame rate, noise, image capture at night time, shadow, poor resolution, low signal-to-noise ratio, lack of object shape and textural details in the images, low contrast, etc. To extract diverse features accurately at multiple levels, we have used an improved version of pre-trained VGG-19 deep learning network as an encoder. Also, the transfer learning mechanism in the encoder network enhances the efficacy of the MOD-CVS model. Further, various layers in the proposed VGG-19 deep neural network are capable of preserving the low, mid, and high-level features that are essential for local change detection. The feature pooling framework (FPF) between the encoder and decoder networks efficiently preserves objects of various scales from challenging video frames. In the proposed algorithm, the FPF model effectively learns a mapping from higher-dimensional feature space to a multi-scale as well as multi-dimensional feature space that can classify the foreground and background pixels with simple decision boundaries. The decoder network in the MOD-CVS model contains a stack of convolutional layers that effectively project feature space to image space. The effectiveness of the MOD-CVS algorithm is corroborated using subjective and objective analysis against thirty-six SOTA techniques. It is observed that the MOD-CVS model retains the shape of the moving object accurately with a reduced amount of pores and holes as compared to the SOTA techniques. Also, the MOD-CVS provides adequate accuracy for unseen video setups. However, the performance of the MOD-CVS work is reduced when the moving object size is small. Also, the proposed work provides frontier outcomes when there is a higher variation in the scene. In the future, we aim to improve the accuracy of the MOD-CVS by investigating a robust hybridized deep neural architecture.

References is not available for this document.

An Improved VGG-19 Network Induced Enhanced Feature Pooling for Precise Moving Object Detection in Complex Video Scenes

Abstract:

Metadata

Abstract:

Introduction

State-of-the-Art-Techniques

A. SOTA Techniques for Slow Moving Object Detection

B. SOTA Techniques for Moderately and Fast Moving Object Detection