Introduction
Visual surveillance is an essential technique for safety purposes and has two key steps: foreground separation followed by tracking. However, for an effective surveillance system local change detection is the primary assignment. For the last several decades, local change detection from challenging video scenes has been an arduous task and one of the diligent research areas in a visual surveillance system. Foreground segmentation from the image frames has numerous applications: activity recognition [1], traffic supervision [2], industrial monitoring [3], underwater surveillance [4] etc. The moving object detection process can retain the moving objects from the background in a sequence of complex video scenes. Therefore, the aforementioned procedure can be considered a binary classification task in which the pixels corresponding to the background are eliminated, while the pixels resembling moving objects are retained. Foreground separation from the complex video scene is challenging due to the dynamic background, camera jitter, missing information, slowly moving objects, etc. Background subtraction (BGS) approach [5] is the prominent way to partition the foreground from the background. The moving objects of the image frame are separated from the background in this approach. In the last few decades, several researchers developed various techniques across the globe for BGS. However, these existing techniques are well performed for specific challenges only. Also, the effectiveness of these BGS techniques is based on manual parameter tuning and handcrafted features. This gives rise to concerns regarding the development of more efficient and resilient techniques for detecting moving objects. Deep learning frameworks have been instrumental in advancing computer vision applications over the years. Also, for moving object detection deep neural networks are extensively used today, as they can retain low, mid, and high levels feature [6], [7], [8]. Further, the efficiency of the deep neural networks can be enhanced by utilizing a transfer learning strategy.
Several drawbacks have been identified in the deep neural network architecture for local change detection. The incorporation of deep learning frameworks in visual surveillance intensifies the complexity of the system. It has been observed that as the depth of the layers increases, the complexity of the model also escalates. Furthermore, it has been noted that training the deep neural network necessitates a larger number of sample frames. Additionally, it is rare to come across an end-to-end model for local change detection in existing techniques.
Therefore, a remarkable deep learning architecture in the form of an encoder-decoder model that effectively addresses multiple challenges encountered in complex as well as slow-moving video scenes is developed. An improved version of pre-trained VGG-19 deep learning framework as the encoder is adopted in the proposed methodology. The initial two blocks’ weights were set as pre-trained weights, while the weights of the third block were trained specifically on challenging datasets, enhancing the model’s resilience. With a transfer learning strategy, the proposed VGG-19 deep neural network preserves the appropriate features for moving object detection. Subsequently, the feature maps obtained from the encoder are fed into the feature-extracting framework, where features are pooled across different scales along the depth dimension. This is accomplished through the utilization of a max-pooling layer, a convolution layer, and multiple convolutional layers with distinct sampling rates. The decoder network in the proposed scheme effectively projects the feature label to the pixel label.
Therefore, the MOD-CVS contributes in five main ways:
A first and unique attempt for detecting local changes in challenging video datasets comprising of moderate, fast, and slowly-moving objects is made in this work using a feature pooling framework with the improved version of VGG-19 encoder-decoder type architecture.
The proposed algorithm provides better accuracy in four datasets with diverse challenges including slowly-moving objects, moderately and fast-moving objects, Indoor and outdoor image sequences, dynamic backgrounds, camera jitter, night video, low frame rate, thermal, etc.
The proposed model utilized fewer samples to train and attained better accuracy without extracting the temporal information from the challenging video scenes against current SOTA approaches.
Incorporating a transfer learning mechanism in the suggested scheme makes the model to learn the weights efficiently and enhances the efficiency.
A selected number of blocks in the proposed VGG-19 architecture is used to make the model less complex compared to the existing deep neural networks.
The structure of the remaining sections in the article is as outlined below. Section II discusses the literature’s presentations concerning local change detection. The proposed model in depth with a graphical illustration is discussed in Section III. Section IV describes the empirical outcomes analysis and ablation study. Section V provides the article’s conclusions along with a glimpse into future work.
State-of-the-Art-Techniques
One of the most widely investigated topics in the field of computer vision involves the detection of local changes utilizing the background subtraction technique. Many researchers have worked for decades to build robust background subtraction algorithms that can detect objects in motion for complex scenes including objects moving at a relatively low speed where the object motion is confined to a smaller region, objects moving at a relatively high speed where the subsequent frames have higher variation, variations in illumination, camera jitter, shadow, image captured at night time, low-frame rate, low contrast, low resolution, non-static background, etc. Taking into account the latest literature, the SOTA techniques are divided into two parts as follows:
A. SOTA Techniques for Slow Moving Object Detection
Slowly moving object detection deals with the process of identifying and tracking objects that are moving at a relatively low speed, where the subsequent frames have lesser variation. In most cases, the spatial motion of the object is confined to a small area. While there are various techniques and approaches to object detection, detecting slowly moving objects can present specific challenges due to their reduced motion and potentially smaller visual cues. The most commonly used techniques for slowly moving object detection are frame differencing (FD) [14], optical flow (OF) [15], background subtraction (BGS) [16], Feature-based methods (FB) [17], [18], machine learning-based approaches (ML) [19]. The choice of method is decided based on the specific application, the characteristics of the slowly-moving objects, and the available computational resources. Combination or adaptation of multiple techniques may also be required to achieve accurate extraction of slowly-moving objects followed by tracking in various scenarios [20]. The BGS is an effective technique for detecting fast and moderately moving objects in a scene. It provides accurate segmentation of the foreground in real-time with low computational cost when the object moves slowly in a relatively static background. However, this method is sensitive to lighting changes, limited to static background, finds it difficult to handle occlusion, and requires background modeling if the background is not available. Again, it fails to extract the slow-moving object due to limited spatial change of pixels in object area [21], [22], [23]. Moving object detection using OF method use the motion vectors of pixels to determine the direction and magnitude of movement. It is very accurate at identifying and following fast-moving objects. As such, this method is highly adaptable to variations in texture, lighting, and other factors, making it ideal for tracking objects in real-time in video surveillance applications. Nevertheless, it is not able to effectively handle occlusion and is sensitive to image noise. It does not provide depth information about the object being tracked. It may not work well for objects that are not moving or moving slowly as it relies on the movement of objects [15], [24], [25]. FD is a common method for detecting moving objects in a video sequence. It is a fast method that can operate in real time, making it suitable for surveillance systems. It provides a cost-effective solution and can even detect objects that are partially occluded, by comparing changes between frames. However, it is sensitive to noise and small changes, such as camera shake or changes in lighting, which produce false positives and affect the accuracy of the result. It only detects moving objects that differ from the background, making it unsuitable for detecting objects having a similar color or texture as that of background [26], [27]. The FB methods can handle challenging scenarios where the appearance of the object varies due to different lighting or complex backgrounds. These methods can reduce the computational burden and process video in real time by extracting specific image features. On the contrary, the performance of these methods degrades significantly, if the features are not robustly detected or the features are affected by noise or occlusion. They also often require fine-tuning or retraining when dealing with new object classes or motion characteristics. These methods primarily pay attention to low-level image features like edges, corners, or texture patterns, without explicitly incorporating high-level semantic information. As a result, they may not be able to differentiate between objects with similar low-level features, leading to errors in object detection or tracking [28], [29], [30].
B. SOTA Techniques for Moderately and Fast Moving Object Detection
Moderately and fast moving object detection deals with the process of identifying and tracking objects that are moving at a relatively high speed where the subsequent frames have higher variation.
Some of the SOTA ML and Deep-learning-based approaches have been discussed for moderate and fast-moving object detection in the literature. The object detection algorithm known as Single Shot Multi-Box Detector (SSD), introduced by Liu et al. [31] demonstrates efficient object detection in images by achieving a favorable balance between accuracy and speed. SSD applies non-maximum suppression (NMS) to filter out redundant bounding box prediction and produce the final set of object detection. The prime advantages of SSD are its simplicity, speed, and ability to detect objects at multi-scales. However, SSD sacrifices some accuracy for faster inference speed. It utilizes a predetermined set of anchor boxes to detect objects at various scales. Choosing the right scales and aspect ratios for these anchor boxes can be challenging. Objects that significantly deviate from these predefined anchor boxes may not be accurately detected. It can not also handle highly occluded objects. The faster R-CNN framework for object detection introduced by Ren et al. [32] become a popular and influential method in computer vision. This technique efficiently and accurately localizes objects with high precision using a region proposal network (RPN) to generate candidate object proposals. It allows the entire object detection system for end-to-end training towards optimizing the overall performance. However, it is more complex compared to previous object detection methods. It involves multiple components, including a region proposal network, a shared convolutional backbone, and an object-specific classifier. This complexity can make it more challenging to understand and implement. Lin et al. [33] introduced a novel loss function called Focal Loss, specifically designed for dense object detection tasks like object detection and instance segmentation. The Focal Loss addresses the issue of class imbalance and the overwhelming number of easy negative examples that can hinder the training of object detectors. Focal Loss introduces an additional hyper-parameter, called the focusing parameter, which controls the rate at which the loss is down-weighted for easy negatives. Choosing an appropriate value for this parameter requires careful tuning, and an improper setting can affect the performance of the model. Zhou et al. [34] propose a novel method for object detection called “Objects as Points”, which aims at achieving efficient real-time object detection. This method demonstrates impressive real-time performance, enabling fast object detection in videos and live-streaming applications. While the method achieves high accuracy in detecting objects, the localization accuracy may not be as precise as some other object detection methods that rely on bounding boxes. This limitation might affect tasks that require precise localization, such as object tracking or fine-grained object recognition. Hei Law and Jia Deng. present a novel object detection framework called CornerNet, which detects objects by treating them as paired key points. CornerNet represents objects as key points and models the object’s spatial information, which helps in precise localization and reduces false positives. This approach treats objects as key points. Hence, it may struggle with objects that have complex or highly variable poses. Since the model primarily focuses on detecting corners, it may not be as effective in cases where key points are not prominent or informative [35]. Zhu et al. [36] proposed Generative Adversarial Networks (GANs) which is a popular class of deep learning models used in generative modeling tasks, such as image synthesis and data generation. Images, audio, and text can all be produced by GANs in a realistic manner while still closely resembling the training set. This makes them valuable in various applications, including art generation, data augmentation, and synthetic data creation for training other models. These networks are prone to a phenomenon known as mode collapse, wherein the generator generates a restricted range of samples, thereby not capturing the full distribution of data. This results in generated samples that lack diversity and fail to cover all the modes present in the training data. MotionRec [37] is composed of a temporal depth reductionist (TDR) block, a motion saliency estimation (MoSENet) network, and regression and classification blocks. This represents the initial endeavor to concurrently localize and classify dynamic entities within a video referred to as MOR (Moving object recognition) using a unified deep learning framework in a single stage CDNet-2014 dataset. Unified frameworks may sometimes be computationally expensive. In AE-NE [38] is entirely unsupervised. It operates with a fixed set of hyperparameters, and the architecture of the autoencoder is dynamically determined based on image size and background complexity, devoid of manual supervision. The autoencoder is additionally trained to anticipate background noise, enabling the calculation of a pixel-dependent threshold for foreground segmentation in each frame. This model is ill-suited for processing night videos, as indicated by the low score it achieved in this category on the CDNet-2014 dataset. The model is not recommended for scenarios where the video is anticipated to depict substantial stationary objects over an extended duration. TSS [39] has made significant contributions to computer vision and video analysis. This method can learn hierarchical features from data, enabling them to discern patterns and variations in motion more effectively than traditional methods. It may be highly specialized and may not generalize well across different domains or environmental conditions. Fine-tuning or retraining might be necessary for optimal performance in diverse settings. A real time multiple object tracking [40] method is based on a modified version of deep simple online and real time tracking (Deep SORT) algorithm. Deep learning methods can handle a large number of objects and complex scenes simultaneously, making them suitable for tracking multiple objects in crowded environments. Training deep learning models for multiple object tracking requires large annotated datasets, which can be time-consuming and expensive to create, particularly for diverse scenarios. Jiawei et al. [41] proposed a 3D video object detection framework emphasizing enduring temporal visual correlation, termed BA-Det. BA-Det operates as a two-stage object detector, proficient in concurrently acquiring knowledge in object detection and temporal feature correspondence through the introduced feature metric object bundle adjustment (OBA) loss. The method exclusively concentrates on objects, such as cars, trucks, and trailers. The effectiveness of flexible objects like pedestrians has not been explored. Further, the related works on background subtraction techniques using explainable deep learning frameworks, outlined in a Table 1 while emphasizing the principal contributions, advantages and disadvantages.
It is found that all the above SOTA methods discussed in the literature related to slow moving object detection are capable of identifying the objects when the variation among the consecutive frames is much less. However, the SOTA schemes addressed for moderately and fast-moving object detection detect the object when there is a higher variation among the successive frames. Hence, from the above discussion, it may be concluded that a single method can not detect all types of moving objects at various speeds. This motivated us to develop a moving object detection framework using a VGG-19 architecture with structural modification-induced FPF to detect moving objects at slow, moderate, and fast speeds. In the proposed design the improved VGG-19 architecture can retain details at various levels. The proposed VGG-19 architecture-induced FPF module capable of preserving the details of objects at several scales and different speeds. The designed decoder architecture can effectively project features to image space. Section III of the paper focuses on the proposed methodology.
The Proposed Algorithm
This article presents a unique and durable deep-learning model for foreground segmentation from a complex video scene for various challenging scenarios. Here, we have developed a deep learning model in which a modified VGG-19 network is used as an encoder integrated with a feature pooling framework (FPF) to effectively detect objects at diverse sizes from the video scenes. The FPF block can retain the sparse and dense features from image frames that are suitable for local change detection. The decoder network learns a mapping from the feature label into a pixel label effectively. Fig. 1 represents the developed network with the dimensions of each layer of the feature map in detail.
A. Encoder Network
The design in this work has improved the pretained VGG-19 network and adhered as an encoder network. A typical VGG-19 network is used for several image-processing applications. Nonetheless, the said framework has yet to be explored for foreground segmentation. Here, we have used the abilities of the VGG-19 network for foreground separation. The original VGG-19 network [46] has five blocks, each with stacked convolutional layers, and the activation function is the rectified linear unit (ReLU). Convolutional layers can retain the input image’s spatial information and the ReLU function in the proposed model activates the required neurons that boost the efficiency of the architecture.
The proposed model capitulated with an altered form of deep VGG-19 network, which comprises the starting three blocks. Where the weights of the first two blocks are the same as the weights of the original VGG-19 architecture [46], and the weights of the third block are accomplished by using the transfer-learning (T-L) strategy for the challenging dataset. T-L, as a mechanism, assimilates information from the input to the output domain. In the developed technique, applying the T-L strategy investigates novel tasks built upon the foundation of tasks previously learned by the original deep VGG-19 network. Also, the T-L strategy enhances the model’s speed and robustness, particularly when training on a limited number of samples. To optimize the utilization of high spatial resolution and frequency details, the fourth and fifth blocks of the original VGG-19 network have been omitted in the MOD-CVS. A detailed description of the altered VGG-19 deep learning model with dimensions of each layer of the feature map is presented in Fig. 2. The high spatial frequency features are retained at the first block of the encoder by using
Detailed description of the altered VGG-19 deep learning model in the block diagram.
B. Feature Pooling Framework
To effectively preserve objects of different scales from challenging video scenes, this work presents a feature pooling framework (FPF) between the encoder and decoder networks which is shown in Fig. 3. Also, the dimensions of each layer of the feature map of the FPF module are shown in Fig. 3. The max-pooling layer is hybridized in the FPF module with 64,
C. Decoder Network
Spatial information of the complex video scene is essential for effective moving object detection. Therefore, the developed decoder network comprises a stack of conv. layers in the proposed model that preserve spatial information efficiently. The initial conv. layer consists of 64 filters with a
Analysis of Simulation Based Experimental Results
The developed model is running on a Windows 10 operating system with 8GB RAM with Python programming. The proposed work is trained and tested over the NVIDIA Tesla T4 GPU given by the Google Co-laboratory pro version. The proposed work is implemented by utilizing the TensorFlow backend with the Keras library. The significance of the presented model is tested on the challenging data sets [9], [10], [11], [13], [12]. The efficiency of our developed algorithm is corroborated by resembling its results with the outcomes acquired by thirty-six SOTA techniques using subjective and objective analysis.
A. Parameter Settings and Training Details
A NVIDIA Tesla T4 GPU system with a batch size of 2 is used to train the model from beginning to end. The developed model’s reduced batch size can have a special regularisation effect and help the model converge more quickly. There are P pixels in each frame and N = 25 frames are used to train this model. Furthermore, we train the model using the binary cross entropy loss (BCEL) function. This compares each pixel’s actual and predicted class labels.
To train the proposed approach, we used the RMSProp optimizer with
B. Subjective Analysis
For slow moving objects the visual demonstration of the detected results achieved by the existing techniques and our developed algorithm is presented in Fig. 4. Fig. 4 (a) and (b) depict the original frames and associated ground-truth images, respectively. The results obtained by the Badri et al. [47] technique are presented in Fig. 4 (c) where the said technique detected background pixels as foreground pixels for various slow moving image sequences. Fig. 4 (d) represents the detected outcomes obtained by the Zhu et al. [48] scheme where the missed alarm rate is high. Fig. 4 (e) and Fig. 4 (f) denote the outcomes achieved by the Sahoo et al. [20], and Sahoo et al. [49] techniques respectively, where a high false negative rate is observed. The outcomes attained by the developed model are illustrated in Fig. 4 (g) where the background and foreground pixels are classified accurately. Fig. 5 (a) and Fig. 5 (b) indicate the input images and the associated ground-truth frames. From Fig. 5 (g), it is evident that the developed technique accurately captured the moving object shape, demonstrating lower false negative and false positive rates against the Badri et al. [47], Zhu et al. [48], Sahoo et al. [20], and Sahoo et al. [49] existing techniques presented in 5 (c), (d), (e), and (f) respectively.
The change detection output is visually analyzed using seven sequences chosen from the CD-Net 2014 dataset. The challenging effects on video scenes include low contrast, non-static background, low frame rate, noise, shadow, poor resolution, low signal-to-noise ratio, lack of object shape and textural details in the images, etc. The developed technique’s performance is visually compared with that of six established deep learning methods, including BSUV-Net _SemanticBGS [50], BSUV-Net 2.0 [51], Cascaded CNN [52], DeepBS [53], Fast BSUV-Net 2.0 [51], WisenetMD [54]. Fig. 6 (a) and (b) represent input images and their associated ground-truth frames respectively. The object detection outcomes achieved by BSUV-Net_ SemanticBGS as demonstrated in Fig. 6 (c), where it can be seen that the background is identified as the foreground. Fig. 6 (d) represents BSUV-Net 2.0 [51] technique outcomes where numerous false alarms are present in the target scene. The segmented outcome of the Cascaded CNN [52] method is showcased in Fig. 6 (e), where the said technique is unable to detect a few information of the object in motion. Fig. 6 (f) shows outcomes of the DeepBS [53] method where numerous edge pixels are absent due to imbalanced pixel values across various video frames, this leads to a significant number of missed alarms in the detected outcomes. The outcomes of the Fast BSUV-Net 2.0 [51] technique are represented in Fig. 6 (g), where the mentioned technique incorrectly categorized certain pixels of an object as background. Fig. 6 (h) represents the WisenetMD [54] algorithm’s results, where this method encounters difficulty in discerning subtle variations in grey values, resulting in the generation of ghost. In contrast, the MOD-CVS showcased in Fig. 6 (i), gives better performance against the existing SOTA techniques as well as precisely classifying background and foreground accurately. In complex video scenes, the developed technique can successfully determine the shapes of moving objects. The MOD-CVS is further tested in wallflower datasets as shown in Fig. 7. Original frame and associated ground truth images are represented in Fig. 7 (a) and (b). Fig. 7 (c) illustrate the result attained by the developed technique. From Fig. 7 (c), it is found that the developed technique attained better results for wallflower dataset. Again, the MOD-CVS is validated on Star dataset. The input frame and their corresponding ground-truth image are presented in Fig. 8 (a) and (b). Fig. 8 (c) illustrates the proposed method’s results where it is noted that the MOD-CVS has the capability to accurately classifying the foreground as well as background pixels with lesser noise.
Foreground segmentation for various sequences: (a) original frame (b) ground-truth image, outcomes attained by BGS technique dependent on (c) BSUV-Net _SemanticBGS [50], (d) BSUV-Net 2.0 [51], (e) Cascaded CNN [52], (f) DeepBS [53], (g) Fast BSUV-Net 2.0 [51], (h) WisenetMD [54], and (i) MOD-CVS.
Foreground segmentation for various sequences:(a) original frame (b) ground-truth image, (c) outcomes attained by MOD-CVS for wallflower dataset.
Foreground segmentation for various sequences: (a) original frame (b) ground-truth image, (c) outcomes attained by MOD-CVS for Star dataset.
C. Objective Analysis
To assess the efficacy of the developed technique, we have made a quantitative distinction between the developed technique and the prevailing SOTA techniques for slow moving objects including average F-measure (AF) and average miss classification error (AMCE) are outlined in Table 2, and Table 3. From these Tables, It is found that the developed algorithm attained a greater value of AF with a reduced value of AMCE against the Badri et al. [47], Zhu et al. [48], Sahoo et al. [20], and Sahoo et al. [49] contemporary methods.
To further justify the efficiency of this proposed algorithm, the developed model is tested on the CD-Net 2014 dataset with various challenging sequences, including averages for Precision (AP), Recall (AR), F-measure (AF), and Percentage of Wrong Classification (APWC). The objective is to simultaneously reduce the percentage of wrong classifications (PWC) and increase F-measure, Precision, Recall [59]. We compared the result obtained by CD-Net 2014 datasets against eighteen existing BGS SOTA methods, including eight deep learning techniques: DeepBS [53], WisenetMD [54], Fast BSUV-Net 2.0 [51], SemanticBGS [60], BSUV-Net [50], BSUV-Net + SemanticBGS [50], IUTIS-5 [61], and BMN-BSN [62]. Table 4 shows that the proposed model achieves superior values for AP, AR, and AF while exhibiting a lower APWC compared to all SOTA deep learning techniques. Also, the MOD-CVS compared with ten non-deep learning existing techniques: SWCD [63], CVABS [64], PAWCS [65], WiSARDrp [66], Multimode Background [67], BMOG [68], WeSAMBE [69], RT-SBS-v1 [70], M4CD Version 2.0 [71], and CL-VID [72]. In Table 4, it is evident that the developed technique shows higher values for AP, AR, and AF, while also presenting a lower APWC compared to SOTA techniques that are not based on deep learning.
Further, to check the efficiency of the MOD-CVS, the experiment has been done on Star datasets, which consists of image sequences of challenging video scenes: noise in the video scene, non-static background, changes in lighting conditions, and shadow. we have compared with five SOTA techniques: GMM [73], DPGMM [74], Feature bags [75], Video plane [13], and Self-organizing [76]. We employed the average similarity measure [76] to assess the effectiveness of the developed technique. The average similarity measure attained by the proposed approach compared to different SOTA methods is shown in Table 5. The results in Table 5 indicate that the MOD-CVS exhibits higher accuracy in the average similarity measure on the Star datasets in comparison to other current SOTA techniques considered.
Eventually, to evaluate the efficacy of the developed model, a well-known Wallflower dataset is used for testing that contains indoor and outdoor video scenes captured by a CCD camera on a non-static background, illumination variations, and video noise. The effectiveness of the developed technique is validated through a comparative analysis with nine established SOTA techniques: Fuzzy Mode [77], ViBe [78], BRPCA [79], GMM [73], Codebook [80], DeepBS [53], Triplet CNN [81], MsEDNet [82], and STAM [83]. The evaluation metric employed for this database is AF. Analysis of Table 6 reveals that the proposed algorithm achieves the highest AF values compared to all the considered SOTA techniques.
D. Unseen Video Setup
In an unseen video arrangement, the training, as well as the testing set, contains different videos. The proposed framework is trained with the Claire, Mother daughter, and Grandma image sequences, and for testing Akiyo, Teleprompter, and Speech image frames are used. Similarly, the model is trained using the Salesman, Teleprompter, and Speech image sequences, and for testing Miss and Suzie image frames are used. From Table 7, it is observed that the designed model attained a better average F-measure value for the unseen setup. Similarly, we have investigated the efficacy of the MOD-CVS in unseen setup for the wallflower, and Star databases. Table 7 indicate that the developed model exhibited better AF values for the wallflower and Star databases in an unseen configuration. Additionally, the effectiveness of the developed algorithm is assessed in unseen setups for the CD-Net 2014 dataset. As shown in Table 8, the proposed model demonstrated satisfactory accuracy compared to established BGS techniques. In this table, Bl, Pe, Sw, Bo, Pa, Tp, Ts, Bs, Co, and T1 depicts the blizzard (from BadWeather), pedestrian (from Baseline), sidewalk (from Camera Jitter), boats (from Dynamic Background), parking (from Intermittent Object Motion), turnpike05fps (from Low Framerate), tramstation (from Night Videos), busstation (from Shadow), corridor (from Thermal), and turbulence1(from Turbulence), respectively.
E. Ablation Study
To analyse the importance of each element in the developed BGS deep-learning based framework, an ablation study is performed. Table 9 demonstrates the efficacy the developed algorithm with and without the GAP layer. It is found that, the inclusion of the GAP layer in the proposed model consistently yields a higher AF value when compared to the version without the GAP layer across all challenging videos. Likewise, an ablation study of the proposed architecture is conducted, exploring its performance both without and with the integration of a feature pooling framework (FPF). From Table 10, it is observed that the proposed algorithm with the FPF module is capable of attaining higher accuracy as compared to without the FPF module. The FPF module between the encoder and decoder effectively learns a mapping from high-dimensional feature to a multi-dimensional feature.
Additionally, the ablation study culminated in a run-time comparison of the proposed approach compared to various SOTA techniques using the CDNet-2014 dataset. Table 11 reveals that the processing time of the developed architecture is 21 frames per second, underscoring the comparatively lower computational complexity of the MOD-CVS compared to many existing SOTA methods.
The proposed method is tested for effectiveness with k-fold cross-validation. The results of the MOD-CVS’s performance with k-fold cross-validation (k = 5 and k = 10) are outlined in Table 12. In this study, when k equals 5, we partitioned the entire set of 159,278 frames from CDNet-2014 dataset into 5 folds. Initially, we utilised the first fold for testing and the remaining folds for training during the training of the proposed model. Subsequently, the second fold served as the testing set, with the remaining folds employed for training, and this process continued iteratively. Similarly, k = 10 was used to train the suggested model, and 159,278 frames on CDNet-2014 are divided into 10 folds. Subsequently, testing is carried out with one fold while training is conducted with the remaining folds in a sequential manner. According to the data in Table 12, the suggested approach utilizing a k-fold cross-validation with values of k equal to 5 and 10 demonstrates outcomes with an average F-Measure of 0.8105 and 0.8157 for k = 5 and 10, respectively. However, the developed MOD-CVS technique without cross-fold validation training mechanism attains a higher value of average F-Measure equals to 0.8269.
Further, the efficacy of the proposed MOD-CVS model is verified in Table 13 which illustrates a comparison of the average F-Measure between the proposed MOD-CVS and various Swin Transformer-based methods. It is found that the proposed MOD-CVS method got comparatively higher value than the existing methods. Furthermore, the consistency can be demonstrated for NJU2K [57], STERE [56], NLPR [58], and DUTS [55] datasets. Fig. 9 (a) and (b) depict the input frame and its corresponding ground-truth image, respectively. In Fig. 9 (c), the outcomes of the proposed method are portrayed, emphasizing the MOD-CVS’s proficiency in precisely categorizing both foreground and background pixels.
In Table 14 evaluation of the average F-Measure between the autoencoder-based AE-NE [38] approach and the proposed MOD-CVS method on CDNet-2014 dataset is verified. It is clearly demonstrated that the MOD-CVS exhibits a higher F-measure value in comparison to alternative methods.
Conclusion
This research work tackles the task of detecting moving objects in challenging video scenes by employing a deep-learning architecture with an encoder-decoder design. The proposed model detects moving objects in complex video scenes including objects moving at different speeds, low contrast, non-static background, low frame rate, noise, image capture at night time, shadow, poor resolution, low signal-to-noise ratio, lack of object shape and textural details in the images, low contrast, etc. To extract diverse features accurately at multiple levels, we have used an improved version of pre-trained VGG-19 deep learning network as an encoder. Also, the transfer learning mechanism in the encoder network enhances the efficacy of the MOD-CVS model. Further, various layers in the proposed VGG-19 deep neural network are capable of preserving the low, mid, and high-level features that are essential for local change detection. The feature pooling framework (FPF) between the encoder and decoder networks efficiently preserves objects of various scales from challenging video frames. In the proposed algorithm, the FPF model effectively learns a mapping from higher-dimensional feature space to a multi-scale as well as multi-dimensional feature space that can classify the foreground and background pixels with simple decision boundaries. The decoder network in the MOD-CVS model contains a stack of convolutional layers that effectively project feature space to image space. The effectiveness of the MOD-CVS algorithm is corroborated using subjective and objective analysis against thirty-six SOTA techniques. It is observed that the MOD-CVS model retains the shape of the moving object accurately with a reduced amount of pores and holes as compared to the SOTA techniques. Also, the MOD-CVS provides adequate accuracy for unseen video setups. However, the performance of the MOD-CVS work is reduced when the moving object size is small. Also, the proposed work provides frontier outcomes when there is a higher variation in the scene. In the future, we aim to improve the accuracy of the MOD-CVS by investigating a robust hybridized deep neural architecture.